Data Attribution in Large Language Models via Bidirectional Gradient Optimization
2026-06-03 • Machine Learning
Machine LearningComputation and Language
AI summaryⓘ
The authors explore how to figure out which parts of the training data had the biggest impact on the text generated by large language models. They do this by slightly changing the model based on the generated output and then measuring how these changes affect the model's behavior on the original training data. Their approach works at different levels of detail, helping explain both the facts and style learned by the model. They tested their method on known datasets and found it better than previous ways of tracking training influence. This helps make AI systems easier to understand and hold accountable.
Large Language ModelsTraining Data AttributionAuto-Regressive ModelsGradient OptimizationModel InterpretabilityInfluence MetricsData ProvenanceGradient AscentGradient DescentAccountable AI
Authors
Frédéric Berdoz, Luca A. Lanzendörfer, Kaan Bayraktar, Roger Wattenhofer
Abstract
Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.