Is Reuse All You Need? A Systematic Comparison of Regular Expression Composition Strategies
Berk Çakar, Charles M. Sale, Sophie Chen, Dongyoon Lee, James C. Davis
公開日: 2025/3/26
Abstract
Composing regexes is a common but challenging engineering activity. Software engineers struggle with regex complexity, leading to defects, performance issues, and security vulnerabilities. Researchers have proposed tools to synthesize regexes automatically, and recent advances in LLMs have also shown promise in generating regexes. Meanwhile, developers commonly reuse existing regexes from codebases and internet sources. No work to date has compared these various regex composition strategies, leaving software engineers unaware about which to use and researchers uncertain about open problems. We address this gap through a systematic evaluation of regex reuse, formal synthesis, and LLM-based generation strategies. We curate a novel dataset of 901,516 regexes mined from open-source software projects and internet sources (RegexReuseDB), accompanied by a set of 55,448 regex composition tasks defined by a target regex and its corresponding positive and negative string pairs (RegexCompBench). To address the absence of an automated regex reuse formulation, we design and implement reuse-by-example, the first programming by example approach that leverages RegexReuseDB. Our evaluation then benchmarks reuse-by-example, formal synthesizers, and LLMs on many aspects of interest to software engineers, including accuracy, maintainability, computational efficiency, and result diversity. Although all three approaches solve most composition tasks accurately, only reuse-by-example and LLMs excel over the range of metrics we applied, and reuse-by-example in particular offers engineers the variance in candidates that they say they find helpful. Ceteris paribus, prefer the cheaper solution--for regex composition, perhaps reuse is all you need. Our findings provide insights for developers selecting regex composition strategies and inform the design of tools to improve regex reliability in software systems.