Permutation-Invariant Tabular Data Synthesis
More Info
expand_more
Abstract
Tabular data synthesis is a promising approach to circumvent strict regulations on data privacy. Although the state-of-the-art tabular data synthesizers, e.g., table-GAN, CTGAN, TVAE, and CTAB-GAN, are effective at generating synthetic tabular data, they are sensitive to column permutations of input data. In this work, we conduct an impact and root-cause analysis of sensitivity to column permutations through extensive empirical analysis. Specifically, we show that changing the input column order increases the statistical difference between real and synthetic data by up to 39\%, due to the encoding of tabular data. To address this challenge, we first attempts to find an optimal column order to improve tabular data synthesis. Next, we propose AE-GAN, an effective tabular data synthesizer that leverages the latent representation of tabular data to regulate its sensitivity to column permutations while incurring low training overhead. AE-GAN is composed of an Autoencoder (AE) to efficiently represent tabular data as latent vectors, a Generative Adversarial Network (GAN) to generate realistic synthetic data, and a classifier to improve the semantic integrity of the generated records. It combines the flexibility of unsupervised training with the control offered by supervised training, thereby ensuring the statistical similarity between real and synthetic data. The evaluation of AE-GAN on five datasets shows that it is not only more permutation-invariant than the prior state-of-the-art, but also results in better downstream analysis based on the generated data.