Exploring Novel Data Storage Approaches for Large-Scale Numerical Weather Prediction

2026-02-19Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingDatabases
AI summary

The authors studied new ways to store and access large amounts of data for weather prediction and other high-performance computing (HPC) tasks. They tested two modern storage systems, DAOS and Ceph, by adapting their software and comparing these to a traditional system called Lustre on the same computers. Their experiments showed that both new systems worked well, but DAOS was better at handling very large data amounts and scaling up. They also discussed the challenges of switching to these newer systems and the potential benefits they offer. This research provides useful insights for HPC users considering object storage but does not suggest completely replacing older methods.

High-Performance Computing (HPC)Numerical Weather Prediction (NWP)POSIX file systemNVMe SSDObject storageDAOSCephLustre file systemI/O benchmarkingScalability
Authors
Nicolau Manubens Gil
Abstract
Driven by scientific and industry ambition, HPC and AI applications such as operational Numerical Weather Prediction (NWP) require processing and storing ever-increasing data volumes as fast as possible. Whilst POSIX distributed file systems and NVMe SSDs are currently a common HPC storage configuration providing I/O to applications, new storage solutions have proliferated or gained traction over the last decade with potential to address performance limitations POSIX file systems manifest at scale for certain I/O workloads. This work has primarily aimed to assess the suitability and performance of two object storage systems -namely DAOS and Ceph- for the ECMWF's operational NWP as well as for HPC and AI applications in general. New software-level adapters have been developed which enable the ECMWF's NWP to leverage these systems, and extensive I/O benchmarking has been conducted on a few computer systems, comparing the performance delivered by the evaluated object stores to that of equivalent Lustre file system deployments on the same hardware. Challenges of porting to object storage and its benefits with respect to the traditional POSIX I/O approach have been discussed and, where possible, domain-agnostic performance analysis has been conducted, leading to insight also of relevance to I/O practitioners and the broader HPC community. DAOS and Ceph have both demonstrated excellent performance, but DAOS stood out relative to Ceph and Lustre, providing superior scalability and flexibility for applications to perform I/O at scale as desired. This sets a promising outlook for DAOS and object storage, which might see greater adoption at HPC centres in the years to come, although not necessarily implying a shift away from POSIX-like I/O.