Information-Theoretic Requirements for Gradient-Based Task Affinity Estimation in Multi-Task Learning

2026-04-09Machine Learning

Machine Learning
AI summary

The authors explain why multi-task learning (MTL), which trains models on several tasks at once, often produces mixed results. They find that for MTL to work well, the tasks must share many of the same training examples; otherwise, the gradient signals become unreliable and don’t reveal true task relationships. When there is enough overlap (above 40%), the method reliably reflects known biological connections, but most popular benchmarks have too little overlap, making their results inconsistent. This insight helps clarify why MTL has been unpredictable in past research.

multi-task learninggradient alignmenttraining instance overlapdistributional shiftbiological pathwaybenchmark datasetsMoleculeNetTDCgradient-task correlationphase transition
Authors
Jasper Zhang, Bryan Cheng
Abstract
Multi-task learning shows strikingly inconsistent results -- sometimes joint training helps substantially, sometimes it actively harms performance -- yet the field lacks a principled framework for predicting these outcomes. We identify a fundamental but unstated assumption underlying gradient-based task analysis: tasks must share training instances for gradient conflicts to reveal genuine relationships. When tasks are measured on the same inputs, gradient alignment reflects shared mechanistic structure; when measured on disjoint inputs, any apparent signal conflates task relationships with distributional shift. We discover this sample overlap requirement exhibits a sharp phase transition: below 30% overlap, gradient-task correlations are statistically indistinguishable from noise; above 40%, they reliably recover known biological structure. Comprehensive validation across multiple datasets achieves strong correlations and recovers biological pathway organization. Standard benchmarks systematically violate this requirement -- MoleculeNet operates at <5% overlap, TDC at 8-14% -- far below the threshold where gradient analysis becomes meaningful. This provides the first principled explanation for seven years of inconsistent MTL results.