On Data Engineering for Scaling LLM Terminal Capabilities

2026-02-24Computation and Language

Computation and Language
AI summary

The authors studied how training data is created and used for large language models that interact with computer terminals. They made a tool called Terminal-Task-Gen to create synthetic training tasks and produced a large open-source dataset named Terminal-Corpus. Using this data, they trained models called Nemotron-Terminal that performed much better on terminal-related tasks, matching bigger models' accuracy. They also shared their models and most of their data publicly to help other researchers.

large language modelssynthetic data generationterminal agentscurriculum learninglong context trainingscaling behaviordatasetmodel checkpointsopen sourceQwen3
Authors
Renjie Pi, Grace Lam, Mohammad Shoeybi, Pooya Jannaty, Bryan Catanzaro, Wei Ping
Abstract
Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.