T-REST: A watermark for autoregressive tabular large language models

Nguyen Hoang Minh, Minh Hieu

T-REST: A watermark for autoregressive tabular large language models

Bachelor thesis (2024)

Authors

Minh Hieu Nguyen Hoang Minh Electrical Engineering, Mathematics and Computer Science

Contributors

Lydia Chen Data-Intensive Systems (mentor)

Jeroen Galjaard Data-Intensive Systems (mentor)

C. Zhu Data-Intensive Systems (mentor)

Rihan Hai Web Information Systems (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Tabular data Autoregressive Large language models Watermark

To reference this document use:

http://resolver.tudelft.nl/uuid:566481e7-5dda-43c6-94f3-eab8a31eed31

More Info

expand_more

Published Date

28-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Tabular data is one of the most common forms of data in the industry and science. Recent research on synthetic data generation employs auto-regressive generative large language models (LLMs) to create highly realistic tabular data samples. With the increasing use of LLMs, there is a need to govern the data generated by these models, for instance, by watermarking the model output. While the state-of-the-art Soft Red List watermarking framework has shown impressive results on standard language models, it can not be seamlessly applied to models fine-tuned for generating tabular data due to i) column permutation and ii) the task’s nature of generating low entropy sequences. We propose Tabular Red GrEen LiST (T-REST), an adaptation of the Soft Red List watermarking algorithm on tabular LLMs that is agnostic to column permutation and improves detection efficiency by employing a weighted count method that favors columns with higher entropy. Our experiments on 4 real-world datasets demonstrate that T-REST introduces a nonsignificant drop of 3% in the synthetic data quality compared to the non-watermarked data, using the resemblance and downstream machine learning efficiency metrics, while achieving high detection accuracy with AUROC of over 0.98. T-REST is insusceptible to any column or row permutation and is robust against post-editing attacks on categorical columns by maintaining a True Positive Rate (TPR) of over 0.85 when 50% of categorical values are modified.

Files

CSE3000_Minh_Nguyen_5591937_6_... (pdf)

(pdf | 1.5 Mb)

Unknown license