L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

2026-06-23 • Computation and Language

Computation and LanguageMachine Learning

AI summaryⓘ

The authors created a new dataset called L3Cube-MahaPOS to help computers understand parts of speech in Marathi, a language spoken by millions but lacking enough resources for language technology. They manually labeled over 32,000 sentences from news articles using a set of 16 grammar tags, making sure the data was clean and consistent. They tested multiple types of language models on this dataset, with the best reaching about 89% accuracy. The dataset, guidelines, and model results have been shared publicly to support more research on Marathi language processing.

Part-of-Speech taggingMarathi languageUniversal DependenciesTokenizationMorphologyNamed Entity RecognitionConditional Random Fields (CRF)BiLSTMTransformer modelsDataset annotation

Authors

Hariom Ingle, Ronit Ghode, Ishwari Gondkar, Jidnyasa Harad, Raviraj Joshi

Abstract

Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most spoken languages worldwide, it remains severely under-resourced in annotated corpora and standardised evaluation benchmarks. Marathi presents unique challenges for computational modelling owing to its rich morphology, relatively free word order, lack of capitalisation conventions, and pervasive code-mixing with Hindi and English. We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. We benchmark the dataset across six model families spanning HMM, CRF, BiLSTM, BiLSTM+CharCNN, MuRIL, and the Marathi-specific transformer MahaBERT-v2. The best system achieves 88.67\% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes. We release the dataset, annotation guidelines, and trained model checkpoints to foster further research in Marathi NLP.

View PDFOpen arXiv