On the utility of metadata to optimize machine learning workflows

More Info
expand_more

Abstract

Over the last two decades, the machine learning (ML) field has witnessed a dramatic expansion, propelled by burgeoning data volumes and the advancement of computational technologies. Deep learning (DL) in particular has demonstrated remarkable success across a wide range of domains, including healthcare, mobility, life sciences, and energy systems. This success has been further accelerated by the availability and efficiency of open-source ML frameworks like TensorFlow and PyTorch, making ML methodologies more accessible than ever.

However, this rapid growth has brought its own set of challenges. The proliferation of ML models and related artifacts, such as datasets, have brought abundant information during the ML lifecycle. The descriptive and property information of these artifacts is referred as metadata. Yet current practices, such as model cards used in public model zoos and tools to track metadata within scripts, cannot fully captured the metadata of these artifacts, let alone a standardized approach for their management, and access. In addition, the prevailing practice of managing ML/DL scripts via traditional software repositories, while adequate for software engineering, falls short in addressing the unique needs of ML workflows, such as model reuse and comparative analysis. These practices hinder the effective use of structured and comprehensive metadata representation. This disconnect points to a pressing need for improved methodologies and tools in the ML field.

In response to these challenges, this thesis delves into the development and exploitation of structured metadata representations within ML model zoos. In Chapter 2, we first propose a metamodel that represent different types of metadata, thus transforming the metadata from being merely descriptive to being queryable and machine-readable. The structured nature of our metamodel allows for more efficient querying and retrieval of information, which is a substantial improvement over the traditional, text-based descriptions.

Additionally, the thesis explores the use of metadata to optimize various ML processes, particularly in the selection of appropriate models for specific tasks, i.e., model inference and fine-tuning. In Chapter 3, we investigate the optimization of ML inference queries in heterogeneous model zoos using a Mixed-Integer-Programming-based optimizer. This optimizer, which considers multiple objectives such as accuracy and inference speed, provides a robust framework for model selection and execution planning. In Chapter 4, the research extends to model selection for fine-tuning. We investigate on predicting model performance, particularly accuracy, in scenarios where data domains shift, thus negating the need for constant model fine-tuning. By selectively choosing only the most promising candidates, this method substantially lowers the computational burden and associated costs of extensive model fine-tuning.

Overall, this thesis investigates the representation and application of metadata. The insights and methodologies presented not only improve the efficiency and effectiveness of ML workflows but also pave the way for further exploration in the integration of metadata within ML practices, highlighting the continual development and potential for advancements in ML.