Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit
2026-06-02 • Computation and Language
Computation and LanguageComputers and Society
AI summaryⓘ
The authors tested if big language models alone can accurately classify misinformation discussions online, especially on Reddit comments about environment, health, and immigration. They found that smaller, fine-tuned models like RoBERTa perform better than large, zero-shot models at identifying if a comment believes or corrects misinformation. Bigger models did not necessarily do better, and some even struggled with sensitive content due to safety restrictions. The authors conclude that training models specifically for the task is still more reliable than just scaling up general language models for misinformation detection.
large language modelsmisinformation classificationfine-tuningzero-shot learningRoBERTamacro-F1 scorePolitiFactReddit commentsbelief detectionsafety-alignment
Authors
JooYoung Lee, Lin Tian, Angela Brillantes, Adriana-Simona Mihăiţă, Marian-Andrei Rizoiu
Abstract
As large language models (LLMs) become default tools for online information verification, an implicit assumption follows them: that scale and general capability are sufficient for nuanced classification of misinformation discourse. We test this assumption directly on 900 Reddit comments spanning three PolitiFact-verified misinformation claims (environment, health, immigration), labelled as belief (propagates the claim), fact-check (corrects it), or other. We compare nine models across three paradigms -- BART-MNLI, three Llama variants, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, Claude Sonnet 4.6), and fine-tuned DistilBERT and RoBERTa -- under universal and topic-specific label schemas. The assumption does not hold. Fine-tuned RoBERTa reaches 0.62 macro-$F_1$ against a best zero-shot result of 0.50 (Claude Haiku 4.5), at a fraction of the per-query cost; the supervised advantage is concentrated on the belief class, the implicit, affective category every zero-shot model under-detects. Scaling does not help: Llama-3-8B matches Llama-3-70B, and Claude Sonnet 4.6 underperforms the smaller Haiku under generic labels, collapsing belief detection to 0.17 and refusing outright on a subset of comments flagged as sensitive. This is a safety-alignment artefact, not a capacity limit. Label schema and topic jointly shape zero-shot performance, with the same model varying by more than 0.13 macro-$F_1$ across topics under matched labels. In a verification context, where missing belief is the costlier error, task-specific fine-tuning remains the more reliable choice despite the proliferation of large generative models.