Will Scaling Improve Social Simulation with LLMs?
2026-07-02 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how making language models bigger affects their ability to simulate social behaviors. They found that larger models generally get better at predicting opinions and behaviors, especially for groups well-represented in common English data. However, tasks like predicting long-term social trends and modeling less common opinions improve more slowly with size. Some behaviors linked to human biases and learning didn’t get better even with bigger or fine-tuned models. Overall, the authors suggest scaling helps many social simulations but not all, especially in less common areas.
Large Language Modelsscaling lawssocial simulationopinion modelingbehavioral simulationlongitudinal forecastingmodel calibrationhuman cognitive biasestransformer architecturesfine-tuning
Authors
Caleb Ziems, William Held, Su Doga Karaca, David Grusky, Tatsunori Hashimoto, Diyi Yang
Abstract
Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.