Leveraging LLMs for subjective value detection in argument statements
More Info
expand_more
Abstract
This paper investigates the use of Large Language Models (LLMs) for automatic detection of subjective values in argument statements in public discourse. Understanding the underlying values of argument statements could enhance public discussions and potentially lead to better outcomes. The LLM utilization methods tested were zero- and few-shot prompting, as well as chain-of-thought prompts. In order to compare the predictions made by the LLM, a set of ground truth labels was required as an established baseline. For these labels, either single majority labels or multi-value labels were considered, both derived from a set of aggregated human annotations. Results indicated that LLM performance was sub optimal, achieving a maximum weighed F1 score of 0.594 for single-value chain-of-thought predictions. Additionally, current metrics were found inadequate for assessing LLM performance on a highly subjective task such as value detection, evidenced by poor scores in multi-value predictions despite subjective evaluation suggesting otherwise. Furthermore, a last experiment was aimed at capturing a specific annotator’s subjectivity. This yielded inconsistent results, with F1 scores peaking around 0.4, indicating that LLMs are not well-suited for emulating individual human subjectivity.