BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

2026-06-03 • Machine Learning

Machine Learning

AI summaryⓘ

The authors explain that big biological datasets need smart ways to simplify and understand the data. They focus on Autoencoders (AEs), a type of AI tool that helps shrink data but is tricky to set up right. Because testing all settings takes a lot of work, researchers often use default ones that might not work well. To fix this, the authors created BBOmix, a large open resource for testing different AE setups on real biological data, helping improve and compare methods. They also check how well common measures predict actual usefulness and test several optimization approaches to set a standard for future studies.

high-throughput sequencingomics dataAutoencoderdimensionality reductionunsupervised learninghyperparameter optimizationreconstruction lossmulti-omicsbenchmarkrepresentation learning

Authors

Luca Thale-Bombien, Jan Ewald, Ralf König, Aaron Klein

Abstract

The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sensitive to architectural choices and hyperparameters, and unsupervised optimization typically relies on reconstruction loss, which may be a poor proxy for downstream utility. Exhaustive hyperparameter optimization (HPO) is computationally expensive, leading researchers to frequently rely on suboptimal default configurations. To democratize access to large-scale unsupervised HPO research, we introduce $\textbf{BBOmix}$, the first open-source tabular benchmark for unsupervised representation learning on real-world biological data. Our benchmark includes 105,000 evaluations across four AE architectures and seven multi-omics modalities from the TCGA and SCHC datasets. We quantify the correlation between reconstruction loss and downstream task performance and provide an extensive evaluation of state-of-the-art single-fidelity, multi-fidelity, and transfer learning HPO methods, establishing a rigorous baseline for future research in unsupervised biological representation learning.

View PDFOpen arXiv