W. Sun
6 records found
1
Amalur
The Convergence of Data Integration and Machine Learning
Machine learning (ML) training data is often scattered across disparate collections of datasets, called <italic>data silos</italic>. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy constraints, data often cannot leave the premises of data silos; hence model training should proceed in a decentralized manner. In this work, we present a vision of bridging traditional data integration (DI) techniques with the requirements of modern machine learning systems. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness, efficiency, and privacy of ML models. Towards this direction, we analyze ML training and inference over data silos. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning, and federated learning.
@enThe rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model predictions often operate in separate execution environments, leading to redundant engineering and computations. Additionally, the diverging mathematical foundations of data processing and machine learning hinder cross-optimizations by combining these two components, thereby overlooking potential opportunities to expedite predictive pipelines. In this paper, we propose an operator fusing method based on GPU-accelerated linear algebraic evaluation of relational queries. Our method leverages linear algebra computation properties to merge operators in machine learning predictions and data processing, significantly accelerating predictive pipelines by up to 317x. We perform a complexity analysis to deliver quantitative insights into the advantages of operator fusion, considering various data and model dimensions. Furthermore, we extensively evaluate matrix multiplication query processing utilizing the widely-used Star Schema Benchmark. Through comprehensive evaluations, we demonstrate the effectiveness and potential of our approach in improving the efficiency of data processing and machine learning workloads on modern hardware.
@enAmalur
Data Integration Meets Machine Learning
Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the premises of data silos, hence model training should proceed in a decentralized manner. In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning and federated learning.
@enRecent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPUs. With the proliferation of machine learning, more databases have started to provide native support for the basic building blocks of ML algorithms, i.e., linear algebra operators such as matrix multiplication (MM). Despite the recent increasing interest in processing relational joins using matrix multiplication (MM-join), two crucial questions still remain open: i) how efficient are current MM-join implementations compared to the GPU-based join algorithms; ii) how should practitioners choose among MM-join and conventional GPU-based joins given different data characteristics.In this paper, we compare the execution time, and memory I/O of MM-join against multiple GPU hash joins. An empirical analysis of our experimental results reveals that the state-of-the-art hash join implementation shows substantial scalability for various data characteristics. In contrast, MM-join outperforms the SOTA hash join in low join selectivity and low table cardinality but shows unsatisfactory scalability due to synchronous data movement and computation.
@enCommissioning studies of the CMS hadron calorimeter have identified sporadic uncharacteristic noise and a small number of malfunctioning calorimeter channels. Algorithms have been developed to identify and address these problems in the data. The methods have been tested on cosmic ray muon data, calorimeter noise data, and single beam data collected with CMS in 2008. The noise rejection algorithms can be applied to LHC collision data at the trigger level or in the offline analysis. The application of the algorithms at the trigger level is shown to remove 90% of noise events with fake missing transverse energy above 100 GeV, which is sufficient for the CMS physics trigger operation.
@en