Using Publisher Partisanship for Partisan News Detection

A Comparison of Performance between Annotation Levels

More Info
expand_more

Abstract

News is the main source of information about events in our neighborhood and around the globe. In an era of digital news, where sources vary in quality and news spreads fast, it is critical to understand what is being consumed by the public. Partisan news is news that favors certain political parties or ideologies. It is undesirable because news organization should aim for an objective and balanced reporting. An automatic system that can classify news to be partisan or non-partisan is thus desired. Such a system (partisan news detector) requires a sufficient amount of data to learn the pattern of partisanship in news articles. However, these labels are expensive to collect since manual inspection of each article is required.

Inferring partisanship labels from the partisanship of publishers is another approach that has been used in previous research. By treating all articles by partisan publishers to be partisan news and those by non-partisan publishers to be non-partisan, it is easy to collect a large number of labeled articles. This way of deciding labels is noisy, making it more difficult for a detector to learn.

This thesis compared the performance of using publisher-level labels and article-level labels for partisan news detection. The detector was designed as a binary classifier. We compared the difference in performance across several feature sets to ensure the observed difference was due to the annotation level, not the choice of specific classifiers. The experiments were performed on two datasets of different properties to ensure the generalizability of the results. We found that classifiers trained with publisher-level labels have higher recall but lower F1-score compared to classifiers trained with article-level labels. We also observed that the classifiers overfit on publishers but reducing the overfitting with feature selection did not improve the performance. Comparing the performance difference between the two datasets, we concluded that an important factor that determines the performance achievable by the publisher-level labels is the quality of publishers that are included in the dataset. This is valuable for future dataset collection.

Our work provides a benchmark performance of publisher-level labels, which can be used as baselines for future research that investigate other methodologies to utilize the noisy labels. We also compared the performance between the two levels of annotation and concluded that partisan news detectors trained with article-level labels are more practical to be used in a fully-automated system since they have on average 10\% higher F1-scores than those trained with publisher-level labels. However, the high recall of the latter makes them applicable in use cases where high recall is desired.

Files

Unknown license