An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience

2026-04-14 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors describe how they built and trained Apertus, a large open-source AI language model, using Europe's Alps supercomputer. They explain the technical challenges they faced, like fixing slow data storage and making the computer connections reliable. Their work shows how a supercomputer can be adapted into a powerful AI training system for public research. They also discuss how their system can keep improving and support future AI model updates beyond just this one training.

Large Language ModelssupercomputerApertusmultilingual AINVIDIA GH200machine learning platformhigh-performance computingmodel pre-trainingdata storage bottlenecksfine tuning

Authors

Jonathan Coles, Stefano Schuppli, Lukas Drescher, Fawzi Roberto Mohamed, Elia Palme, Henrique Mendonça, Miguel Gila, Mark Klein, Maxime Martinasso, Joost VandeVondele, Torsten Hoefler, Thomas Schulthess, Josh Romero, Igor Gorodetsky, Ryan Hankins, Isa Wazirzada, Martin Jaggi, Antoine Bosselut, Imanol Schlag, Antoni-Joan Solergibert i Llaquet, Alejandro Hernández Cano, Theofilos Ioannis Manitaras, Nicholas John Browning

Abstract

Large Language Models (LLMs) have surged as a transformative technology for science and society, prompting governments worldwide to pursue sovereign AI capabilities that ensure data compliance and cultural representation. However, the associated capital costs and engineering complexity required to train these models have largely restricted such capabilities to the private sector, leaving a significant gap for public institutions. This paper details the engineering journey behind training \textit{Apertus}, a fully open multilingual foundation model, on the \textit{Alps} supercomputer. Representing a first-of-its-kind achievement for academia at the 70B parameter scale, we successfully deployed a massive pre-training campaign on one of Europe's largest systems for open science, powered by NVIDIA GH200 Grace Hopper Superchips. We detail the challenges encountered in readying HPC infrastructure for training AI models, from overcoming storage bottlenecks to stabilizing large-scale interconnects, and the lessons learned in transforming a supercomputer into a resilient software-defined Machine Learning Platform. Finally, we discuss the post-training requirements and evolution of our Machine Learning platform, outlining how this initial release lays the groundwork for a sustained, iterative operational capability, in particular for fine tuning foundation models, that extends well beyond a single model training run.

View PDFOpen arXiv