DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme
Daichi Mukunoki
Published: 2025/8/1
Abstract
As the demand for AI computation rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable to scientific computations due to accuracy requirements. The Ozaki scheme - an accurate matrix multiplication method proposed by Ozaki et al. in 2012 - enables FP64 matrix multiplication (DGEMM) using low-precision matrix multiplication units, such as FP16 Tensor Cores. This approach has since been extended to utilize integer arithmetic, offering lower computational cost compared to floating-point-based implementations. In fact, it has achieved higher performance than hardware FP64 operations on GPUs equipped with fast INT8 Tensor Cores designed for AI workloads. However, recent AI-oriented processors trends have shifted toward improving the performance of low-precision floating-point operations, such as FP8, rather than integer operations. Motivated by this shift, this study revisits the use of low-precision floating-point operations in the Ozaki scheme. Specifically, we explore the use of FP8 Tensor Cores. In addition, for processors that support very slow or no hardware-based FP64 operations, we also consider FP64 arithmetic emulation based on integer arithmetic. This completely eliminates hardware FP64 instructions. Furthermore, we explore the use of blocking in the inner-product dimension to accelerate FP16-based implementations. We demonstrate the effectiveness of these methods by evaluating the performance on an NVIDIA RTX Blackwell architecture GPU.