Modeling Spatially Correlated Failure-time Data Under Two Distance Functions with an Application to Titan GPU Data

Jared M. Clark, Jie Min, Yueyao Wang, Yili Hong, George Ostrouchov

Published: 2025/9/5

Abstract

One common approach to statistical analysis of spatially correlated data relies on defining a correlation structure based solely on unknown parameters and the physical distance between the locations of observed values. However, some data have a complex spatial structure that cannot be adequately described with the physical distance alone. In this work, the spatial failure-time data of focus contains information on GPUs that are connected through a network fabric topology that differs from their physical layout and that is expected to introduce additional correlations. The proposed lifetime regression model includes random effects capturing the dependency due to physical location as well as random effects explaining the dependency due to logical connections between GPUs. The analysis of this GPU dataset serves as an example of models with multiple spatial random effects and the ideas presented can be extended to other applications with complex spatial structures. A Bayesian modeling scheme is recommended for this class of analyses. The examples in this work use the software package, Stan, to produce Markov chain Monte Carlo draws for parameter estimation. This modeling effort is validated through simulation which demonstrates accuracy in statistical inference. We also apply the developed framework to the large-scale Titan GPU failure time data.

Modeling Spatially Correlated Failure-time Data Under Two Distance Functions with an Application to Titan GPU Data | SummarXiv | SummarXiv