Circular Image

A. Katsifodimos

168 records found

Authored

BlockJoin

Efficient Matrix Partitioning Through Joins

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operator ...

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded da ...

Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow eng ...

Apache Flink

Stream Analytics at Scale

Cutty

Aggregate sharing for user-defined windows

Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to ...

Bridging the Gap

Towards optimization across linear and relational Algebra

Advanced data analysis typically requires some form of preprocessing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink ...

Emma in action

Declarative Dataflows for scalable data analysis

Parallel dataow APIs based on second-order functions were originally seen as a exible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient executi ...

Apache Flink™

Stream and Batch Processing in a Single Engine

Apache Flink is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms ...
Over the past years, parallel dataflow systems have been employed for advanced analytics in the field of data mining where many algorithms are iterative. These systems typically provide fault tolerance by periodically checkpointing the algorithm's state and, in case of failure, r ...
The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tigh ...

Delta

Scalable data dissemination under capacity constraints

In content-based publish-subscribe (pub/sub) systems, users express their interests as queries over a stream of publications. Scaling up content-based pub/sub to very large numbers of subscriptions is challenging: users are interested in low latency, that is, getting subscription ...

The efficient processing of XQuery still poses significant challenges. A particularly effective technique to improve XQuery processing performance consists of using materialized views to answer queries. In this work, we consider the problem of choosing the best views to materi ...

ViP2P

Efficient XML management in DHT networks

We consider the problem of efficiently sharing large volumes of XML data based on distributed hash table overlay networks. Over the last three years, we have built ViP2P (standing for Views in Peer-to-Peer), a platform for the distributed, parallel dissemination of XML data among ...

Minersoft

Software retrieval in grid and cloud computing infrastructures

One of the main goals of Cloud and Grid infrastructures is to make their services easily accessible and attractive to end-users. In this article we investigate the problem of supporting keyword-based searching for the discovery of software files that are installed on the nodes of ...

LiquidXML

Adaptive XML content redistribution

We propose to demonstrate LiquidXML, a platform for managing large corpora of XML documents in large-scale P2P networks. All LiquidXML peers may publish XML documents to be shared with all the network peers. The challenge then is to efficiently (re-)distribute the published conte ...
Several large-scale Grid infrastructures are currently in operation around the world, federating an impressive collection of computational resources, a wide variety of application software, and hundreds of user communities. To better serve the current and prospective users of Gri ...
Grid infrastructures are in operation around the world, federating an impressive collection of computational resources and a wide variety of application software. In this context, it is important to establish advanced software discovery services that could help end-users locate s ...
In this paper, we investigate the problem of supporting keyword-based searching for the discovery of software resources that are installed on the nodes of largescale, federated Grid computing infrastructures. We address a number of challenges that arise from the unstructured natu ...