In the past decade data-driven approaches have been at the core of many business and research models. In critical domains such as healthcare and banking, data privacy issues are very stringent. Synthetic tabular data is an emerging solution to privacy guarantee concerns. Generati
...
In the past decade data-driven approaches have been at the core of many business and research models. In critical domains such as healthcare and banking, data privacy issues are very stringent. Synthetic tabular data is an emerging solution to privacy guarantee concerns. Generative Adversarial Networks (GANs) are one of the emerging solutions for synthesizing data. However in order to capture all relevant relationships between columns, tabular data needs to be numerically encoded. As columns might be of different types, this is a challenging task as proven by recent approaches. Throughout this paper, we focus on the dimensionality explosion problem, which leads to high-dimensional datasets alongside computational overhead and increase in training time. We introduce a novel synthesis pipeline - LCT-GAN - an improvement to the current state-of-the-art in tabular data synthesis CTAB-GAN. Our approach addresses the dimensionality explosion problem by introducing a low-dimensional embedding step via an autoencoder prior to training. It is then combined with a novel conditional GAN architecture, operating in latent space. After thorough evaluation, we observe that our solution achieves more than 30\% improvement in certain statistical metrics in comparsion to CTAB-GAN, accompanied by 5 fold decrease in size and 150 times speedup in training time for a single epoch. We successfully show that it is possible to embed data using autoencoders, and that GANs are able to learn complex relationships in latent space in the context of tabular data.