Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models

Akalanka Galappaththi, Francisco Ribeiro, Sarah Nadi

公開日: 2025/9/29

Abstract

Data science libraries, such as scikit-learn and pandas, specialize in processing and manipulating data. The data-centric nature of these libraries makes the detection of API misuse in them more challenging. This paper introduces DSCHECKER, an LLM-based approach designed for detecting and fixing API misuses of data science libraries. We identify two key pieces of information, API directives and data information, that may be beneficial for API misuse detection and fixing. Using three LLMs and misuses from five data science libraries, we experiment with various prompts. We find that incorporating API directives and data-specific details enhances Dschecker's ability to detect and fix API misuses, with the best-performing model achieving a detection F1-score of 61.18 percent and fixing 51.28 percent of the misuses. Building on these results, we implement Dschecker agent which includes an adaptive function calling mechanism to access information on demand, simulating a real-world setting where information about the misuse is unknown in advance. We find that Dschecker agent achieves 48.65 percent detection F1-score and fixes 39.47 percent of the misuses, demonstrating the promise of LLM-based API misuse detection and fixing in real-world scenarios.

Detecting and Fixing API Misuses of Data Science Libraries Using Large Language Models | SummarXiv | SummarXiv