Optimizing the Variant Calling Pipeline Execution on Human Genomes Using GPU-Enabled Machines
Ajay Kumar, Praveen Rao, Peter Sanders
Published: 2025/9/10
Abstract
Variant calling is the first step in analyzing a human genome and aims to detect variants in an individual's genome compared to a reference genome. Due to the computationally-intensive nature of variant calling, genomic data are increasingly processed in cloud environments as large amounts of compute and storage resources can be acquired with the pay-as-you-go pricing model. In this paper, we address the problem of efficiently executing a variant calling pipeline for a workload of human genomes on graphics processing unit (GPU)-enabled machines. We propose a novel machine learning (ML)-based approach for optimizing the workload execution to minimize the total execution time. Our approach encompasses two key techniques: The first technique employs ML to predict the execution times of different stages in a variant calling pipeline based on the characteristics of a genome sequence. Using the predicted times, the second technique generates optimal execution plans for the machines by drawing inspiration from the flexible job shop scheduling problem. The plans are executed via careful synchronization across different machines. We evaluated our approach on a workload of publicly available genome sequences using a testbed with different types of GPU hardware. We observed that our approach was effective in predicting the execution times of variant calling pipeline stages using ML on features such as sequence size, read quality, percentage of duplicate reads, and average read length. In addition, our approach achieved 2X speedup (on an average) over a greedy approach that also used ML for predicting the execution times on the tested workload of sequences. Finally, our approach achieved 1.6X speedup (on an average) over a dynamic approach that executed the workload based on availability of resources without using any ML-based time predictions.