Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data

Yang, J.; Smirnova, Alisa; Yang, Dingqi; Demartini, Gianluca; Lu, Yuan; Cudré-Mauroux, Philippe

Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data

Book chapter (2019)

Authors

J. Yang University of Fribourg

Alisa Smirnova University of Fribourg

Dingqi Yang University of Fribourg

Gianluca Demartini University of Queensland

Yuan Lu ING Bank

Philippe Cudré-Mauroux University of Fribourg

Affiliation

External organisation

To reference this document use:

http://resolver.tudelft.nl/uuid:6fb74821-6207-4289-b09f-ee232f74f417

More Info

expand_more

Published Date

2019

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Affiliation

External organisation

Abstract

This paper presents Scalpel-CD, a first-of-its-kind system that leverages both human and machine intelligence to debug noisy labels from the training data of machine learning systems. Our system identifies potentially wrong labels using a deep probabilistic model, which is able to infer the latent class of a high-dimensional data instance by exploiting data distributions in the underlying latent feature space. To minimize crowd efforts, it employs a data sampler which selects data instances that would benefit the most from being inspected by the crowd. The manually verified labels are then propagated to similar data instances in the original training data by exploiting the underlying data structure, thus scaling out the contribution from the crowd. Scalpel-CD is designed with a set of algorithmic solutions to automatically search for the optimal configurations for different types of training data, in terms of the underlying data structure, noise ratio, and noise types (random vs. structural). In a real deployment on multiple machine learning tasks, we demonstrate that Scalpel-CD is able to improve label quality by 12.9% with only 2.8% instances inspected by the crowd.

No files available

Metadata only record. There are no files for this book chapter.