Distance Based Source Domain Selection for Automated Sentiment Classification

Master thesis (2018)

Authors

L.E. Razoux Schultz Mechanical Engineering

Contributors

P. Mohajerin Esfahani (mentor)

M. Loog (mentor)

T. Keviczky (mentor)

Faculty

Mechanical Engineering, Mechanical Engineering

To reference this document use:

http://resolver.tudelft.nl/uuid:bc430a45-3377-40de-9408-428b39b4f196

More Info

expand_more

Published Date

04-04-2018

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Mechanical Engineering

Abstract

Automated Sentiment Classification (SC) on short text fragments has been an upcoming field of research. Different machine learning techniques and word representation models have proven to be successful in classifying sentiment of opinion expressions in various domains, i.e. different topics or source media. However, when training on a source domain different from the target domain of interest, we encounter a large domain shift resulting in poor cross domain classification performance.

In this report, we first provide information on the key principles of SC, starting with the SC pipeline and the encountered domain shift. Then, we show a novel method of selecting a source domain by using four unsupervised distance measures: Chi squared distance, Maximum Mean Discrepancy (MMD), Earth Mover’s Distance (EMD) and Kullback-Leibler Divergence (KLD). We evaluate the effectiveness of using these unsupervised measures individually, and in a linear combination, to identify one or more suitable source domains for an SC task for various target domains. This linear combination is proposed as the CMEK model, an acronym of the four measures it uses.

Results show that our proposed CMEK model for source domain selection results in a reduction of adaptation loss by 7 percent points compared to training on a randomly selected source domain. When selecting multiple domains, our proposed selection method is competitive with training on all data.

In the light of general performance, we recommend the CMEK model for source domain selection for an SC task. The CMEK model shows significantly good performance and stable behavior in selecting multiple source domains and it has solid performance in selecting the single best domain.

Files

20180319LRS_Final_report.pdf

(pdf | 1.68 Mb)

- Embargo expired in 30-04-2018