Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Hongyuan Liu, Qiang Wang, Xiaowen Chu

公開日: 2025/1/21

Abstract

This study presents a comprehensive multi-level analysis of the NVIDIA Hopper GPU architecture, focusing on its performance characteristics and novel features. We benchmark Hopper's memory subsystem, highlighting improvements in the L2 partitioned cache and global memory access compared to Ampere and Ada Lovelace. The evaluation of Hopper's fourth-generation tensor cores reveals the benefits of FP8 precision and asynchronous wgmma instructions for matrix operations. Additionally, we investigate the performance of DPX instructions for dynamic programming, distributed shared memory (DSM) for inter-SM communication, and the Tensor Memory Accelerator (TMA) for asynchronous data movement. Through multi-level evaluation, we discover that the Hopper architecture demonstrates significant acceleration potential in real-world applications. For instance, the asynchronous programming model supported by TMA achieves a 1.5x speedup in matrix multiplication, FP8 delivers nearly double the performance of FP16, and DPX instructions accelerate a computational biology algorithm by at least 4.75x. Our findings provide actionable insights for optimizing compute-intensive workloads, from AI training to bioinformatics, on Hopper GPUs.

全文を読む (arXiv.org)