Trace Replay Simulation of MIT SuperCloud for Studying Optimal Sustainability Policies
Wesley Brewer, Matthias Maiterth, Damien Fay
Published: 2025/9/20
Abstract
The rapid growth of AI supercomputing is creating unprecedented power demands, with next-generation GPU datacenters requiring hundreds of megawatts and producing fast, large swings in consumption. To address the resulting challenges for utilities and system operators, we extend ExaDigiT, an open-source digital twin framework for modeling power, cooling, and scheduling of supercomputers. Originally developed for replaying traces from leadership-class HPC systems, ExaDigiT now incorporates heterogeneity, multi-tenancy, and cloud-scale workloads. In this work, we focus on trace replay and rescheduling of jobs on the MIT SuperCloud TX-GAIA system to enable reinforcement learning (RL)-based experimentation with sustainability policies. The RAPS module provides a simulation environment with detailed power and performance statistics, supporting the study of scheduling strategies, incentive structures, and hardware/software prototyping. Preliminary RL experiments using Proximal Policy Optimization demonstrate the feasibility of learning energy-aware scheduling decisions, highlighting ExaDigiT's potential as a platform for exploring optimal policies to improve throughput, efficiency, and sustainability.