Representation geometry shapes task performance in vision-language modeling for CT enterography

2026-04-14 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors studied how to best analyze CT enterography scans used for inflammatory bowel disease with AI that links images and language. They found that averaging features across slices works better for disease classification, while attention methods help better match text and images. Also, focusing on detailed tissue contrast using color-encoded CT values beats adding more scan angles. When generating reports, adding related image retrieval helps more than just training on the reports alone. Their work sets first benchmarks and useful tips for combining image and text data in this kind of 3D medical imaging.

CT enterographyinflammatory bowel diseasevision-language transfer learningmean poolingattention poolingHounsfield Unitsmulti-window RGB encodingretrieval-augmented generationpseudolabelingordinal mean absolute error

Authors

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

Abstract

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

View PDFOpen arXiv