GDTS: GAN-Based Distributed Tabular Synthesizer

Zhao, Z.; Birke, Robert; Chen, Lydia Y.

doi:10.1109/CLOUD60044.2023.00078

GDTS: GAN-Based Distributed Tabular Synthesizer

Conference paper (2023)

Authors

Z. Zhao Data-Intensive Systems

Robert Birke University of Turin

Lydia Y. Chen Data-Intensive Systems

Research Group

Data-Intensive Systems

DOI: https://doi.org/10.1109/CLOUD60044.2023.00078

Federated learning Tabular GAN Tabular data Non-IID

To reference this document use:

http://resolver.tudelft.nl/uuid:4b16cc71-f6b3-4adb-bcfc-5e444cb7bab2

More Info

expand_more

Published Date

2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Research Group

Data-Intensive Systems

Abstract

Generative Adversarial Networks (GANs) are typically trained to synthesize data, from images and more recently tabular data, under the assumption of directly accessible training data. While learning image GANs on Federated Learning (FL) and Multi-Discriminator (MD) systems has just been demonstrated, it is unknown if tabular GANs can be learned from decentralized data sources. Different from image GANs, state-of-the-art tabular GANs require prior knowledge on the data distribution of each (discrete and continuous) column to agree on a common encoding - risking privacy guarantees. In this paper, we propose GDTS, a distributed framework for GAN-based tabular synthesizer. GDTS provides different system architectures to match the two training paradigms termed GDTS_FL and GDTS_MD. Key to enable learning on distributed data is the proposed novel privacy-preserving multi-source feature encoding to capture the global data properties. In addition GDTS encompasses a weighting strategy based on table similarity to counter the detrimental effects of non-IID data and a validation pipeline to easily assess and compare the performance of different paradigms and hyper parameters. We evaluate the effectiveness of GDTS in terms of synthetic data quality, and overall training scalability. Experiments show that GDTS_FL achieves better statistical similarity and machine learning utility between generated and original data compared to GDTS_MD.

Files

GDTS_GAN_Based_Distributed_Tab... (pdf)

(pdf | 1.21 Mb)

- Embargo expired in 25-03-2024

Unknown license