Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

2026-05-22Computation and Language

Computation and Language
AI summary

The authors address the problem of teaching language models to work well in languages with little training data by finding ways to share knowledge from languages with lots of data, like English. They introduce a simple method called LINK, where some English words in the training data are swapped with their translations using bilingual word lists. This word swapping helps the model learn important information useful for other languages without needing extra complex resources. Their tests show this approach speeds up training and improves performance on tasks in eight different languages.

cross-lingual knowledge transfermultilingual language modelspretraininglexical substitutionbilingual vocabularieslow-resource languagesmodel training speeddownstream tasksword-level translationdata augmentation
Authors
Anastasiia Sedova, Natalie Schluter, Skyler Seto, Maartje ter Hoeve
Abstract
Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.