Architecture of Data Lakes

Chihoub, Houssem; Madera, Cédrine; Quix, Christoph; Hai, R.

Architecture of Data Lakes

Book chapter (2020)

Authors

Houssem Chihoub

Cédrine Madera

Christoph Quix

R. Hai External organisation

Affiliation

External organisation

RDF Big data XML JSON Constance Data lake system Metadata management system Query rewriting engine Unified data model

To reference this document use:

http://resolver.tudelft.nl/uuid:af7e63e7-cb63-4e6c-b995-91ea019b19ef

More Info

expand_more

Published Date

2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Affiliation

External organisation

Abstract

This chapter introduces the most important features of data lake systems, and from there it outlines an architecture for these systems. The vision for a data lake system is based on a generic and extensible architecture with a unified data model, facilitating the ingestion, storage and metadata management over heterogeneous data sources. The chapter also introduces a real-life data lake system called Constance that can deal with sophisticated metadata management over raw data extracted from heterogeneous data sources. With embedded query rewriting engines that support structured data and semi-structured data, Constance provides users with a unified interface for query processing and data exploration. Big Data has undoubtedly become one of the most important challenges in database research. A MetaData Management System for data lakes should provide means to handle metadata in different data models (relational, XML, JSON, RDF), and should be able to represent mappings between the metadata entries.