PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation

2026-05-06 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors created a system to detect polarization in text across 22 different languages by fine-tuning specialized language models for each language. They improved training data by generating synthetic examples using GPT-4o-mini and filtered these carefully to keep quality high. By adjusting decision thresholds for each language and combining predictions from different sized models, they improved performance without extra training. Their approach achieved strong results, placing second overall and first in some languages, while other tested model types did not generalize well on new data.

SemEvalPolarization DetectionBinary ClassificationLow-Rank Adaptation (LoRA)Synthetic Data GenerationGPT-4o-miniThreshold TuningEnsemble ModelsMacro-F1 ScoreGeneralization

Authors

Srikar Kashyap Pulipaka

Abstract

We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma~3 models (12B and 27B parameters) per language using Low-Rank Adaptation (LoRA), augmented with synthetic data generated by a large language model (LLM). We employ three synthetic data strategies (direct generation, paraphrasing, and contrastive pair creation) using GPT-4o-mini, with a multi-stage quality filtering pipeline including embedding-based deduplication. We find that per-language threshold tuning on the development set yields 2 to 4\% F1 improvements without retraining. We also use weighted ensembles of 12B and 27B model predictions with per-language strategy selection. Our final system achieves a mean macro-F1 of 0.811 across all 22 languages, ranking 2nd overall of the participating teams, with 1st place finishes in 3 languages and top-3 in 8 languages. We also find that alternative architectures (XLM-RoBERTa, Qwen3) that showed strong development set performance suffered 30 to 50\% F1 drops on the test set, highlighting the importance of generalization.

View PDFOpen arXiv