Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection

2026-05-29Computation and Language

Computation and Language
AI summary

The authors study how people disagree not just on labels in text tasks, but also on the explanations (called rationales) they give for those labels. They look at different ways to collect and evaluate these rationales, especially in tricky tasks like detecting hate speech where opinions vary a lot. By testing many models and evaluation methods together, they find that using softer, less strict ways to represent labels and rationales works better for capturing these differences. Their work suggests we should rethink how we measure explanations in language tasks with subjective content.

human disagreementrationaleslabel aggregationsubjective NLPhate speech detectionexplainability metricspredictive metricsdistributional metricsplausibilityfaithfulness
Authors
Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank, Fosca Giannotti
Abstract
Human disagreement is ubiquitous and well-known in labeling. However, variation in explanations, captured through token-level human rationales, remains far less explored. At the same time, it is unclear how to best evaluate human labels and rationales -- or even how to best aggregate rationales beyond majority vote -- in light of this variation. Yet, rationales may provide additional insights into the richness of human reasoning, that may differ in style, values and interpretations -- especially in subjective NLP tasks like hate speech detection. In this work, we unify diverse models, training strategies, loss functions, and existing evaluation metrics under a single protocol by systematically re-implementing them across different label and rationale representation spaces. Classification metrics are organized around two key properties -- predictive and distributional -- while explainability metrics through three complementary dimensions: plausibility, faithfulness, and complexity. In this unified supervision framework, we evaluate model behavior across classification and explainability metrics, as well as metric sensitivity to the choice of label (hard and soft) and rationale representation space (hard, intermediate and soft). Results show that both hard and soft metrics favor softer representations, highlighting their effectiveness in capturing variation and the need to rethink evaluation in subjective NLP.