COMPASS: A Unified Decision-Intelligence System for Navigating Performance Trade-off in HPC

2026-04-24Performance

Performance
AI summary

The authors present COMPASS, a tool designed to help optimize settings in high-performance computing (HPC) systems by using past operation data. Unlike existing tools, COMPASS can suggest small changes to nearly optimal setups and account for specific domain rules. It turns tuning questions into machine learning problems and evaluates how confident it is in its advice, guiding users on what to try next if confidence is low. Testing shows that COMPASS significantly improves job speed and resource use, works much faster than other methods, and can handle very large datasets.

High-performance computing (HPC)Configuration tuningAutotunersMachine learningOperational tracesJob schedulingPerformance optimizationUncertainty quantificationGenerative methods
Authors
Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, Tanzima Z. Islam, Mohammad Zaeed
Abstract
HPC systems expose many configuration parameters that jointly drive competing objectives. Existing tools such as autotuners recommend good configurations but do not identify minimal changes for a near-miss configuration to meet a performance objective, and they often ignore domain-specific constraints. To address this gap, we introduce COMPASS -- a modular, programmable engine that uses operational traces to generate HPC configuration recommendations and guide tuning decisions. This paper: (1) formalizes configuration questions into query patterns; (2) develops an interactive decision-making engine that formulates these queries as Machine Learning (ML) tasks; (3) quantifies the trustworthiness of its recommendations by providing evidence and quantifying uncertainty, and -- when confidence is low -- provides guidance on which configurations to run next. We validate COMPASS using analytical ground truth, reconstruction accuracy, reproduction of published findings, and when possible, running on real hardware. When integrated with an open-source HPC scheduling simulator, COMPASS cuts average job turnaround time by 65.93% and node usage by 80.93% relative to the state-of-the-art. Moreover, COMPASS achieves up to 100x faster training and 80x faster inference than state-of-the-art generative methods, and scales to traces with 1.3B samples and 126GB of data.