Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers

2026-02-17Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors worked on improving how computers can tell healthy cells from cancerous ones in very large images of skin cancer tissue. They used a special kind of model called Graph Transformers that looks at cells and their neighbors as connected points, instead of just small patches of the image. This method helped the models understand the context around each cell, making them better at distinguishing similar-looking healthy and tumor cells. Their approach performed better than traditional image-based methods on both single large images and multiple images from different patients.

Whole-slide images (WSI)Cutaneous squamous cell carcinoma (cSCC)Graph TransformerCell classificationMorphological featuresTexture featuresPatch-based representationBalanced accuracyConvolutional neural networks (CNN)Vision Transformers
Authors
Lucas Sancéré, Noémie Moreau, Katarzyna Bozek
Abstract
Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of $85.2 \pm 1.5$ ($\pm$ standard error) and $85.1 \pm 2.5$ in 3-fold cross-validation, respectively, whereas the best image-based method reached $81.2 \pm 3.0$. By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four $2560 \times 2560$ pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of $83.6 \pm 1.9$ (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached $78.1 \pm 0.5$.