Distance Based Source Domain Selection for Automated Sentiment Classification

Automated Sentiment Classification (SC) on short text fragments has been an upcoming field of research. Different machine learning techniques and word representation models have proven to be successful in classifying sentiment of opinion expressions in various domains, i.e. different topics or source media. However, when training on a source domain different from the target domain of interest, we encounter a large domain shift resulting in poor cross domain classification performance.

In this report, we first provide information on the key principles of SC, starting with the SC pipeline and the encountered domain shift. Then, we show a novel method of selecting a source domain by using four unsupervised distance measures: Chi squared distance, Maximum Mean Discrepancy (MMD), Earth Mover’s Distance (EMD) and Kullback-Leibler Divergence (KLD). We evaluate the effectiveness of using these unsupervised measures individually, and in a linear combination, to identify one or more suitable source domains for an SC task for various target domains. This linear combination is proposed as the CMEK model, an acronym of the four measures it uses.

Results show that our proposed CMEK model for source domain selection results in a reduction of adaptation loss by 7 percent points compared to training on a randomly selected source domain. When selecting multiple domains, our proposed selection method is competitive with training on all data.

In the light of general performance, we recommend the CMEK model for source domain selection for an SC task. The CMEK model shows significantly good performance and stable behavior in selecting multiple source domains and it has solid performance in selecting the single best domain.


(pdf | 1.68 Mb)
- Embargo expired in 30-04-2018