Architecture of Data Lakes

More Info
expand_more

Abstract

This chapter introduces the most important features of data lake systems, and from there it outlines an architecture for these systems. The vision for a data lake system is based on a generic and extensible architecture with a unified data model, facilitating the ingestion, storage and metadata management over heterogeneous data sources. The chapter also introduces a real-life data lake system called Constance that can deal with sophisticated metadata management over raw data extracted from heterogeneous data sources. With embedded query rewriting engines that support structured data and semi-structured data, Constance provides users with a unified interface for query processing and data exploration. Big Data has undoubtedly become one of the most important challenges in database research. A MetaData Management System for data lakes should provide means to handle metadata in different data models (relational, XML, JSON, RDF), and should be able to represent mappings between the metadata entries.