In large-scale ML, data size becomes a critical variable, especially in the context of large companies, where models already exist and are hard to change and fine-tune. Time to market and model quality are essential metrics, thus looking for ways to select, prune and augment the
...
In large-scale ML, data size becomes a critical variable, especially in the context of large companies, where models already exist and are hard to change and fine-tune. Time to market and model quality are essential metrics, thus looking for ways to select, prune and augment the input data while treating the model as a black box can speed up the process from raw data to productionized model.
Datasets can have thousands of features and many redundant/duplicate samples, for various business logic reasons. In some particular ML flows, it might be that only a subset of them provide most of the input to the final accuracy. Also, looking into ways to provide insights on what data points are the most meaningful can help engineers collect more relevant samples, or focus their attention on specific parts of the data distribution.