GlossAssist -- A Tool to Simplify Corpus Creation and Study the Effect of NLP Models in Low-Resource Documentation Settings

2026-06-03Computation and Language

Computation and LanguageHuman-Computer Interaction
AI summary

The authors created GlossAssist, a tool that helps linguists quickly produce interlinear glossed text (IGT), which is usually slow and hard to make by hand. Their system uses a special method called CWoMP, which relies on a changeable dictionary of word parts (morphemes). When a linguist corrects a mistake, GlossAssist learns from it and improves future predictions right away without needing a full retrain. The authors suggest that language tools like this should include this kind of feedback loop to better support linguists' work.

Interlinear Glossed TextLanguage DocumentationAutomated GlossingCWoMPMorphemeActive LearningLexiconNatural Language ProcessingField LinguisticsAnnotation Tools
Authors
Bhargav Shandilya, Matt Buchholz, Alexis Palmer
Abstract
Interlinear glossed text (IGT) is the standard format for linguistic annotation in language documentation. Producing it manually, however, is often slow and costly. Automated glossing systems have improved substantially in recent years, but adoption among field linguists remains limited. Existing tools are designed to be evaluated rather than used, offering no interpretable path for correction or the incorporation of linguistic expertise back into model behavior. We present GlossAssist, a glossing tool built around the retrieval-based architecture of CWoMP (Contrastive Word-Morpheme Pre-training), which grounds predictions in a mutable lexicon of learned morpheme representations. In conjunction with CWoMP, our system treats each correction by an annotator as part of an active learning setting, which expands the lexicon and improves future predictions without having to retrain the model. In this paper, we present our interface and argue that this feedback loop should be treated as a design requirement for NLP tools aimed at documentary linguists.