On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach

2026-04-10Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how cloud services assign different computing resources to tasks organized in workflows. They focused on a method using graph neural networks (GNNs) combined with deep reinforcement learning to schedule tasks in order to minimize how long workflows take and how much energy they use. They found that this approach can fail when the characteristics of the tasks or environment change a lot from what the model saw during training. This happens because differences in the task structure interfere with how the GNN processes information, making the scheduler less effective. Their work points out important limits of current GNN-based schedulers and suggests the need for better designs that can handle such changes reliably.

Cloud schedulingWorkflow DAGGraph Neural NetworksDeep Reinforcement LearningOut-of-distributionMessage passingPolicy generalizationEnergy consumptionCompletion timeDistribution shift
Authors
Anas Hattay, Fred Ngole Mboula, Eric Gascard, Zakaria Yahoun
Abstract
Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.