Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

David Cortes, Carlos Juiz, Belen Bermejo

Published: 2025/9/3

Abstract

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In this article, we present a detailed analysis of the times reported by MLPerf Training v4.1 on four workloads: BERT, Llama2 LoRA, RetinaNet, and Stable Diffusion, showing that there are configurations that optimise the relationship between performance, GPU usage, and efficiency. The results point to a break-even point that allows training times to be reduced while maximising efficiency.

Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial | SummarXiv | SummarXiv