F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

2026-03-19 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors introduce F2LLM-v2, a set of multilingual language models designed to create word and sentence embeddings in 8 sizes, covering over 200 languages, especially those that usually have fewer resources. They trained these models on a large mix of high-quality public data and used special training tricks to make the models efficient without losing accuracy. Their biggest model performs best on several language understanding tests, and smaller versions work well when less computing power is available. They also share all their data and tools openly to help other researchers.

multilingual embedding modelslanguage models (LLMs)model pruningknowledge distillationmatryoshka learningmid-resource languageslow-resource languagesMTEB benchmarksmodel efficiencyopen-source

Authors

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

View PDFOpen arXiv