Making congestion control robust to per-packet load balancing in datacenters
Barak Gerstein, Mark Silberstein, Isaac Keslassy
Published: 2025/9/9
Abstract
Per-packet load-balancing approaches are increasingly deployed in datacenter networks. However, their combination with existing congestion control algorithms (CCAs) may lead to poor performance, and even state-of-the-art CCAs can collapse due to duplicate ACKs. A typical approach to handle this collapse is to make CCAs resilient to duplicate ACKs. In this paper, we first model the throughput collapse of a wide array of CCAs when some of the paths are congested. We show that addressing duplicate ACKs is insufficient. Instead, we explain that since CCAs are typically designed for single-path routing, their estimation function focuses on the latest feedback and mishandles feedback that reflects multiple paths. We propose to use a median feedback that is more robust to the varying signals that come with multiple paths. We introduce MSwift, which applies this principle to make Google's Swift robust to multi-path routing while keeping its incast tolerance and single-path performance. Finally, we demonstrate that MSwift improves the 99th-percentile FCT by up to 25\%, both with random packet spraying and adaptive routing.