T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address the problem of medical image segmentation, which usually requires expensive detailed labels from experts. They use Vision Language Models (VLMs) that understand images but improve them by adding context from nearby slices in 3D scans to make the results more accurate and realistic. Their method combines information across slices and within slices and adjusts how much of each to use. Tested on several datasets, it shows better accuracy and generalizes well even to different imaging types, outperforming some fully supervised methods. This suggests their approach helps models use information more effectively across medical images.
Medical Image SegmentationVision Language Models (VLMs)3D Medical ImagingTemporal TransformerDice ScoreCross-Domain GeneralizationCross-Modality EvaluationCLIP ModelConvolutional Neural NetworksAdaptive Gating
Authors
Pranjal Khadka
Abstract
Medical image segmentation traditionally relies on fully supervised 3D architectures that demand a large amount of dense, voxel-level annotations from clinical experts which is a prohibitively expensive process. Vision Language Models (VLMs) offer a powerful alternative by leveraging broad visual semantic representations learned from billions of images. However, when applied independently to 2D slices of a 3D scan, these models often produce noisy and anatomically implausible segmentations that violate the inherent continuity of anatomical structures. We propose a temporal adapter that addresses this by injecting adjacent-slice context directly into the model's visual token representations. The adapter comprises a temporal transformer attending across a fixed context window at the token level, a spatial context block refining within-slice representations, and an adaptive gate balancing temporal and single-slice features. Training on 30 labeled volumes from the FLARE22 dataset, our method achieves a mean Dice of 0.704 across 13 abdominal organs with a gain of +0.206 over the baseline VLM trained with no temporal context. Zero-shot evaluation on BTCV and AMOS22 datasets yields consistent improvements of +0.210 and +0.230, with the average cross-domain performance drop reducing from 38.0% to 24.9%. Furthermore, in a cross-modality evaluation on AMOS22 MRI with neither model receiving any MRI supervision, our method achieves a mean Dice of 0.366, outperforming a fully supervised 3D baseline (DynUNet, 0.224) trained exclusively on CT, suggesting that CLIP's visual semantic representations generalize more gracefully across imaging modalities than convolutional features.