T-REST: A watermark for autoregressive tabular large language models

More Info
expand_more

Abstract

Tabular data is one of the most common forms of data in the industry and science. Recent research on synthetic data generation employs auto-regressive generative large language models (LLMs) to create highly realistic tabular data samples. With the increasing use of LLMs, there is a need to govern the data generated by these models, for instance, by watermarking the model output. While the state-of-the-art Soft Red List watermarking framework has shown impressive results on standard language models, it can not be seamlessly applied to models fine-tuned for generating tabular data due to i) column permutation and ii) the task’s nature of generating low entropy sequences. We propose Tabular Red GrEen LiST (T-REST), an adaptation of the Soft Red List watermarking algorithm on tabular LLMs that is agnostic to column permutation and improves detection efficiency by employing a weighted count method that favors columns with higher entropy. Our experiments on 4 real-world datasets demonstrate that T-REST introduces a nonsignificant drop of 3% in the synthetic data quality compared to the non-watermarked data, using the resemblance and downstream machine learning efficiency metrics, while achieving high detection accuracy with AUROC of over 0.98. T-REST is insusceptible to any column or row permutation and is robust against post-editing attacks on categorical columns by maintaining a True Positive Rate (TPR) of over 0.85 when 50% of categorical values are modified.