NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

2026-03-05Computation and Language

Computation and Language
AI summary

The authors created a large Bangla question-answer dataset called NCTB-QA with both answerable and unanswerable questions to help improve reading comprehension systems in Bangla, a language with fewer resources. They tested three transformer models and found that fine-tuning these models on their dataset greatly improved their ability to handle questions, including ones without answers in the text. Their work shows that focusing on specific subject areas and datasets is important for building better AI tools in languages with less training data.

Reading comprehensionLow-resource languagesBanglaQuestion answeringTransformer modelsBERTFine-tuningF1 scoreAdversarial examplesDataset
Authors
Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim
Abstract
Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.