Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark

Siu Hang Ho, Prasad Ganesan, Nguyen Duong, Daniel Schlabig

公開日: 2025/9/22

Abstract

Efficient inference is a critical challenge in deep generative modeling, particularly as diffusion models grow in capacity and complexity. While increased complexity often improves accuracy, it raises compute costs, latency, and memory requirements. This work investigates techniques such as pruning, quantization, knowledge distillation, and simplified attention to reduce computational overhead without impacting performance. The study also explores the Mixture of Experts (MoE) approach to further enhance efficiency. These experiments provide insights into optimizing inference for the state-of-the-art Fast Diffusion Transformer (fast-DiT) model.

Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark | SummarXiv | SummarXiv