Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

2026-03-12Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors explain that instead of seeing pretrained model parameters as a single best starting point, we can think of them as a whole set of possible solutions, some of which are experts for specific tasks. In small models, these expert solutions are very rare and hard to find, but in large, well-trained models, there are many experts nearby. They test a simple method where they randomly tweak parameters, pick the best few, and combine their outputs, finding it works about as well as more complex tuning methods. This suggests large models have many useful specialized solutions close by that can be leveraged easily.

PretrainingParameter vectorGradient descentTask-specific expertsModel ensembleParameter perturbationPPOGRPOEvolution StrategiesLarge-scale models
Authors
Yulu Gan, Phillip Isola
Abstract
Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.