AI summaryⓘ
The authors investigated whether small changes in a language model's weights, captured by LoRA adapters, can reveal what kind of fine-tuning was done and if this relates to harmful behaviors. They created different LoRA adapters with varying training goals and analyzed patterns in their weight changes, successfully identifying the training method and severity of harmful behavior within the same method. However, classifiers trained on one method didn't work well on others. They also found that certain geometric features of the weight changes correlate with harmful responses in the model's behavior. This shows that understanding the shape of weight changes can help detect fine-tuning objectives and potential harms, but methods need specific tuning for each training approach.
LoRA (Low-Rank Adaptation)Fine-tuningLanguage ModelSpectral FeaturesSingular-Value DecompositionBehavioral HarmDPO (Direct Preference Optimization)Activation SteeringLogistic RegressionPrincipal Component Analysis
Abstract
We study whether low-rank spectral summaries of LoRA weight deltas can identify which fine-tuning objective was applied to a language model, and whether that geometric signal predicts downstream behavioral harm. In a pre-registered experiment on \texttt{Llama-3.2-3B-Instruct}, we manufacture 38 LoRA adapters across four categories: healthy SFT baselines, DPO on inverted harmlessness preferences, DPO on inverted helpfulness preferences, and activation-steering-derived adapters, and extract per-layer spectral features (norms, stable rank, singular-value entropy, effective rank, and singular-vector cosine alignment to a healthy centroid). Within a single training method (DPO), a logistic regression classifier achieves AUC~1.00 on binary drift detection, all six pairwise objective comparisons, and near-perfect ordinal severity ranking ($ρ\geq 0.956$). Principal component analysis on flattened weight deltas reveals that training objective is PC1 (AUC~1.00 for objective separation), orthogonal to training duration on PC2. Query-projection weights detect that drift occurred; value-projection weights identify which objective. Cross-method generalization fails completely: a DPO-trained classifier assigns every steering adapter a lower drift score than every DPO adapter (AUC~0.00). In a behavioral evaluation phase, DPO-inverted-harmlessness adapters show elevated harmful compliance on HEx-PHI prompts (mean ASR 0.266 vs.\ healthy 0.112, $Δ= +0.154$), with near-perfect dose--response ($ρ= 0.986$). The geometry-to-behavior rank correlation is $ρ= 0.72$ across 24 non-steered adapters. These results establish that within a controlled manufacturing regime, LoRA weight-space geometry carries objective identity, intensity ordering, and a coarse link to harmful compliance, and that cross-method monitoring requires per-method calibration.