Parameter-free representations outperform single-cell foundation models on downstream benchmarks

2026-02-18Machine Learning

Machine Learning
AI summary

The authors looked at single-cell RNA sequencing data, which shows patterns in gene activity. While complex deep learning models like TranscriptFormer have been used to study this data, the authors tested simpler methods based on straightforward math and careful data processing. They found that these simpler approaches can perform just as well, or even better, on tasks like identifying cell types and predicting disease states, especially when dealing with new cell types or species not seen before. This suggests that basic, understandable techniques can capture important biological information without needing complicated models.

single-cell RNA sequencinggene expressiontransformer modelsTranscriptFormerlinear methodsnormalizationcell-type classificationout-of-distributionfoundation modelslatent vector space
Authors
Huan Souza, Pankaj Mehta
Abstract
Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.