Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

2026-04-22Computation and Language

Computation and Language
AI summary

The authors studied how well a language model trained for coding in one programming language can work in another language without extra training (zero-shot transfer). They found that simply using reinforcement learning (RL) on one language doesn't help and can even hurt performance on other languages. To fix this, they introduced Parallel-SFT, a training method that uses sets of equivalent code written in different languages together, which helps the model learn shared concepts. This approach made the model better at generalizing to new programming languages after RL training. Their analysis showed that Parallel-SFT makes the model's understanding focus more on the code's function rather than its specific language.

Language ModelProgramming LanguagesReinforcement Learning (RL)Fine-tuning (SFT)Zero-shot TransferCross-language TransferCode RepresentationParallel ProgramsLatent SpaceModel Generalization
Authors
Zhaofeng Wu, Shiqi Wang, Boya Peng, Anuj Goyal, Melanie Kambadur, Sebastian Ruder, Yoon Kim, Chloe Bi
Abstract
Modern language models demonstrate impressive coding capabilities in common programming languages (PLs), such as C++ and Python, but their performance in lower-resource PLs is often limited by training data availability. In principle, however, most programming skills are universal across PLs, so the capability acquired in one PL should transfer to others. In this work, we propose the task of zero-shot cross-programming-language transfer for code RL. We find that, for Llama-3.1, RL training for code generation in a source PL fails to improve, and sometimes even degrades, the performance on other target PLs. To address this, we hypothesize that effective RL transfer requires a generalizable SFT initialization before RL. We thus propose **Parallel-SFT**, an SFT strategy that incorporates "parallel programs" -- functionally equivalent code implemented in multiple PLs -- into the data mixture. We demonstrate that this improves transferability: when we subsequently perform RL on our Parallel-SFT model, we observe better generalization to unseen PLs. Analysis of the model internal representations reveals that Parallel-SFT leads to a more functionality-centric latent space, where equivalent programs across PLs are more tightly clustered, which we hypothesize to contribute to the improved transferability.