Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark
Siu Hang Ho, Prasad Ganesan, Nguyen Duong, Daniel Schlabig
公開日: 2025/9/22
Abstract
Efficient inference is a critical challenge in deep generative modeling, particularly as diffusion models grow in capacity and complexity. While increased complexity often improves accuracy, it raises compute costs, latency, and memory requirements. This work investigates techniques such as pruning, quantization, knowledge distillation, and simplified attention to reduce computational overhead without impacting performance. The study also explores the Mixture of Experts (MoE) approach to further enhance efficiency. These experiments provide insights into optimizing inference for the state-of-the-art Fast Diffusion Transformer (fast-DiT) model.