Watermarking of numerical datasets used for ML

A DWT approach for watermarking numerical datasets

More Info
expand_more

Abstract

AI and machine learning have been topics of big interest in the last couple of years, with plenty of applications in many domains. To train these models into useful and desirable tools, a large amount of data is necessary. This data is expensive to collect, becoming one of the most valuable commodities of this century. As the value of data increases, protecting this intellectual property becomes more and more relevant. Watermarking is a technique widely used for data protection in media, but the non-media counterpart has not been researched as thoroughly. In this paper, an adaptation of a common watermarking technique, DWT watermarking, is applied on two datasets used for machine learning. This technique is invisible and robust in signal watermarking, but its performance on a numerical dataset has not been previously researched. A previously devised algorithm was used, but it was adjusted to better fit dataset watermarking. To assess the quality of the watermark, the marked data has been subjected to create, remove, update and zero-out attacks. On top of this, multiple machine-learning models have been trained on the marked data. Initial results show that the proposed technique performs well in terms of invisibility, obtaining similar or better accuracies than models trained on the original data, but it is quite sensitive to attacks. Even small modifications, less than 1\% of the data, can break the signature.

Files

Research_paper.pdf
(pdf | 0.431 Mb)
License info not available