Synthetic tabular data generated by tabular generative models represent an effective means of augmenting and sharing data. It is of paramount importance to trace and audit such synthetic data, avoiding potential harms and risks associated with inappropriate usage. While watermark
...
Synthetic tabular data generated by tabular generative models represent an effective means of augmenting and sharing data. It is of paramount importance to trace and audit such synthetic data, avoiding potential harms and risks associated with inappropriate usage. While watermarking techniques are increasingly used for synthetic images, little is known about how to watermark synthetic tables such that they are imperceptible for humans, detectable by algorithms, and robust against post-editing. In this paper, we present the first watermarking algorithm for tabular diffusion models, which inserts novel ripple watermarks into the latent space of tables. For every synthetic table, the watermark initiates from a central ring within
the Fourier-transformed latent of the table, extending gradually across a large portion of the space. The watermark can be detected by calculating the distance between the Fourier-transformed tabular latent and the ground-truth watermark patch. Additionally, we develop post-editing attacks, including row/column/value deletion and distortion, to evaluate the robustness of the watermark. Our evaluation on four datasets demonstrates that our watermarking scheme effectively preserves the quality of synthetic tables in terms of resemblance, discriminability, and downstream utility. The average quality difference is less than 0.6% compared to non-watermarked data, while maintaining high detectability, with average statistical p-values over 25× lower than 0.02. Additionally, our robustness analysis
shows that the watermark is resilient against various post-editing actions, with
85% of the p-values remaining below 0.05 across all 18 attack settings on four
datasets.