MaaSO: SLO-aware Orchestration of Heterogeneous Model Instances for MaaS

Mo Xuan, Zhang yue, Wu Weigang

公開日: 2025/9/8

Abstract

Model-as-a-Service (MaaS) platforms face diverse Service Level Objective (SLO) requirements stemming from various large language model (LLM) applications, manifested in contextual complexity, first-token latency, and between-token latency. On the other hand, an LLM instance, when configured with different parallelism strategies and inference batch sizes, exhibits distinct performance characteristics and can thus be used to serve different SLO requirements. However, current LLM inference systems typically deploy instances of the same model with identical configurations, lacking mechanisms to leverage such heterogeneity. To fill this research gap, we propose MaaSO, the first MaaS Orchestrator, which comprises three modules: (1) a profiler characterizing instance performance under diverse parallelism strategies and inference batch sizes; (2) a placer optimizing heterogeneous instance configurations; (3) a distributor enabling SLO-aware request distribution and preventing cascaded timeouts in continuous batching. Experiments show that MaaSO improves the SLO satisfaction ratio by 15 to 30% and reduces response latency by 40 to 60% compared to existing approaches, and significantly lowers overall orchestration overhead.

全文を読む (arXiv.org)