Optimizing Korean-Centric LLMs via Token Pruning

2026-04-17Computation and Language

Computation and Language
AI summary

The authors tested different big language models that can work with many languages, focusing mainly on Korean tasks. They used a method called token pruning, which removes parts of the model related to languages that aren't needed, to see if it helps. Their results show that removing unused language data makes the models more stable and often better at Korean translation. However, how well the models follow instructions depends on the specific model type. Overall, the authors found token pruning useful for saving memory in focused applications, even if it only slightly speeds up the model.

multilingual large language modelstoken pruningKorean natural language processingembedding parametersvocabulary configurationsmachine translationcross-lingual representationsinstruction followingmemory optimizationmodel compression
Authors
Hoyeol Kim, Hyeonwoo Kim
Abstract
This paper presents a systematic benchmark of state-of-the-art multilingual large language models (LLMs) adapted via token pruning - a compression technique that eliminates tokens and embedding parameters corresponding to languages irrelevant to the target application. Focusing on Korean-centric natural language processing (NLP) tasks, we evaluate architectures including Qwen3, Gemma-3, Llama-3, and Aya across three vocabulary configurations: Original, English-Korean (EnKo), and English-Korean-Chinese (EnKoZh). Performance is assessed using established benchmarks for general aptitude, cultural literacy, instruction following, and machine translation. Our findings indicate that token pruning significantly improves generation stability by eliminating language confusion, and in the case of machine translation, frequently enhances performance on Korean-specific tasks. While instruction-following capabilities display architecture-dependent variance linked to latent cross-lingual representations, the significant reduction in vocabulary size validates token pruning as a highly effective optimization strategy for memory-constrained, domain-specific deployments, despite modest gains in inference latency.