ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget

2026-04-01 • Computation and Language

Computation and LanguageArtificial IntelligenceInformation Retrieval

AI summaryⓘ

The authors created ORBIT, a large dataset with 20,000 complex questions that need multiple steps of thinking and checking to answer correctly. They made this dataset without spending money on paid services by carefully generating and verifying questions and answers in four steps. ORBIT covers many topics and requires searching the whole web to verify answers. They then used this dataset to train a language model, Qwen3-4B, which performed well on Wikipedia question tasks. The authors also shared their dataset and code publicly to help others.

language modelsweb searchdataset generationmulti-step reasoningquestion answeringverificationsynthetic datasetsQwen3-4BGRPOopen source

Authors

Nandan Thakur, Zijian Chen, Xueguang Ma, Jimmy Lin

Abstract

Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question--answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4--5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.

View PDFOpen arXiv