M. Slokom | TU Delft Repository

Towards Purpose-aware Privacy-Preserving Techniques for Predictive Applications

Doctoral thesis (2024) - M. Slokom (author) , MA Larson (promotor) , Alan Hanjalic (promotor)

In the field of machine learning (ML), the goal is to leverage algorithmic models to generate predictions, transforming raw input data into valuable insights. However, the ML pipeline, consisting of input data, models, and output data, is susceptible to various vulnerabilities an ...

In the field of machine learning (ML), the goal is to leverage algorithmic models to generate predictions, transforming raw input data into valuable insights. However, the ML pipeline, consisting of input data, models, and output data, is susceptible to various vulnerabilities and attacks. These attacks include re-identification, attribute inference, membership inference, and model inversion attacks, all posing threats to individual privacy. This thesis specifically targets attribute inference attacks, wherein adversaries seek to infer sensitive information about target individuals.

The literature on privacy-preserving techniques explores various perturbative approaches, including obfuscation, randomization, and differential privacy, to mitigate privacy attacks. While these methods have shown effectiveness, conventional perturbation based techniques often offer generic protection, lacking the nuance needed to preserve specific utility and accuracy. These conventional techniques are typically purpose unaware, meaning they modify data to protect privacy while maintaining general data usefulness. Recently, there has been a growing interest in purpose-aware techniques.
The thesis introduces purpose-aware privacy preservation in the form of a conceptual framework. This approach involves tailoring data modifications to serve specific purposes and implementing changes orthogonal to relevant features. We aim to protect user privacy without compromising utility. We focus on two key applications within the ML spectrum: recommender systems and machine learning classifiers. The objective is to protect these applications against potential privacy attacks, addressing vulnerabilities in both input data and output data (i.e., predictions).

We structure the thesis into two parts, each addressing distinct challenges in the ML pipeline.
Part 1 tackles attacks on input data, exploring methods to protect sensitive information while maintaining the accuracy of ML models, specifically in recommender systems. Firstly, we explore an attack scenario in which an adversary can acquire the user-item matrix and aims to infer privacy-sensitive information. We assume that the adversary has a gender classifier that is pre-trained on unprotected data. The objective of the adversary is to infer the gender of target individuals. We propose personalized blurring (PerBlur), a personalization-based approach to gender obfuscation that aims to protect user privacy while maintaining the recommendation quality. We demonstrate that recommender system algorithms trained on obfuscated data perform comparably to those trained on the original user-item matrix.
Furthermore, our approach not only prevents classifiers from predicting users' gender based on the obfuscated data but also achieves diversity through the recommendation of (non-stereotypical) diverse items. Secondly, we investigate an attack scenario in which an adversary has access to a user-item matrix and aims to exploit the user preference values that it contains. The objective of the adversary is to infer the preferences of individual users. We propose Shuffle-NNN, a data masking-based approach that aims to hide the preferences of users for individual items while maintaining the relative performance of recommendation algorithms. We demonstrate that Shuffle-NNN provides evidence of what information should be retained and what can be removed from the user-item matrix. Shuffle-NNN has great potential for data release, such as in data science challenges.

Part 2 investigates attacks on output data, focusing on model inversion attacks aimed at predictions from machine learning classifiers and examining potential privacy risks associated with recommender system outputs. Firstly, we explore a scenario where an adversary attempts to infer individuals' sensitive information by querying a machine learning model and receiving output predictions. We investigate various attack models and identify a potential risk of sensitive information leakage when the target model is trained on original data. To mitigate this risk, we propose to replace the original training data with protected data using synthetic training data + privacy-preserving techniques. We show that the target model trained on protected data achieves performance comparable to the target model trained on original data. We demonstrate that by using privacy-preserving techniques on synthetic training data, we observe a small reduction in the success of certain model inversion attacks measured over a group of target individuals. Secondly, we explore an attack scenario in which the adversary seeks to infer users' sensitive information by intercepting recommendations provided by a recommender system to a set of users. Our goal is to gain insight into possible unintended consequences of using user attributes as side information in context-aware recommender systems. We study the extent to which personal attributes of a user can be inferred from a list of recommendations to that user. We find that both standard recommenders and context-aware recommenders leak personal user information into the recommendation lists.
We demonstrate that using user attributes in context-aware recommendations yields a small gain in accuracy. However, the benefit of this gain is distributed unevenly among users and it sacrifices coverage and diversity. This leads us to question the actual value of side information and the need to ensure that there are no hidden `side effects'.

The final chapter of the thesis summarizes our findings. It provides recommendations for future research directions which we think are promising for further exploring and promoting the use of purpose-aware privacy-preserving data for ML predictions. @en

When Machine Learning Models Leak

An Exploration of Synthetic Training Data

Conference paper (2022) - M. Slokom (author) , Peter Paul de Wolf (author) , M.A. Larson (author)

We investigate an attack on a machine learning classifier that predicts the propensity of a person or household to move (i.e., relocate) in the next two years. The attack assumes that the classifier has been made publically available and that the attacker has access to informatio ...

Machine Learning Meets Data Modification

The Potential of Pre-processing for Privacy Enchancement

Book chapter (2022) - Giuseppe Garofalo (author) , M. Slokom (author) , Davy Preuveneers (author) , Wouter Joosen (author) , M. Larson (author)

We explore how data modification can enhance privacy by examining the connection between data modification and machine learning. Specifically, machine learning “meets” data modification in two ways. First, data modification can protect the data that is used to train machine learn ...

Towards user-oriented privacy for recommender system data

A personalization-based approach to gender obfuscation for user profiles

Journal article (2021) - M. Slokom (author) , Alan Hanjalic (author) , MA Larson (author)

In this paper, we propose a new privacy solution for the data used to train a recommender system, i.e., the user–item matrix. The user–item matrix contains implicit information, which can be inferred using a classifier, leading to potential privacy violations. Our solution, calle ...

SimuRec

Workshop on synthetic data and simulation methods for recommender systems research

Conference paper (2021) - Michael D. Ekstrand (author) , Allison Chaney (author) , Pablo Castells (author) , Robin Burke (author) , David Rohde (author) , M. Slokom (author)

There is significant interest lately in using synthetic data and simulation infrastructures for various types of recommender systems research. However, there are not currently any clear best practices around how best to apply these methods. We proposed a workshop to bring togethe ...

Privacy and Audiovisual Content

Protecting Users as Big Multimedia Data Grows Bigger

Book chapter (2019) - M.A. Larson (author) , J. Choi (author) , M. Slokom (author) , Zekeriya Erkin (author) , Gerald Friedland (author) , AP de Vries (author)

This chapter discusses the relationship between privacy and algorithms that make use of large amounts of multimedia data. As users continue to post their audiovisual content online, and as companies continue to collect user profiles and interaction data, concerns about privacy ar ...

Up close, but not too personal

Hypotargeting for recommender systems

Conference paper (2019) - M.A. Larson (author) , M. Slokom (author)

Hypotargeting for recommender systems (hyporec) is the idea of controlling the number of unique lists of items that a recommender system can recommend to users during a given time period. The main advantage of hyporec is oversight. If a recommender system offers only a finite num ...

BlUrM(or)e

Revisiting gender obfuscation in the user-item matrix

Conference paper (2019) - Christopher Strucks (author) , M. Slokom (author) , M. Larson (author)

Past research has demonstrated that removing implicit gender information from the user-item matrix does not result in substantial performance losses. Such results point towards promising solutions for protecting users’ privacy without compromising prediction performance, which ar ...

Data masking for recommender systems

Prediction performance and rating hiding

Conference paper (2019) - M. Slokom (author) , MA Larson (author) , Alan Hanjalic (author)

Data science challenges allow companies, and other data holders, to collaborate with the wider research community. In the area of recommender systems, the potential of such challenges to move forward the state of the art is limited due to concerns about releasing user interaction ...

Comparing recommender systems using synthetic data

Conference paper (2018) - M. Slokom (author)

In this work, we propose SynRec, a data protection framework that uses data synthesis. The goal is to protect sensitive information in the user-item matrix by replacing the original values with synthetic values or, alternatively, completely synthesizing new users. The synthetic d ...