CP-Bench: Evaluating Large Language Models for Constraint Modelling
Kostis Michailidis, Dimos Tsouros, Tias Guns
公開日: 2025/6/6
Abstract
Constraint Programming (CP) is widely used to solve combinatorial problems, but its core process, namely constraint modelling, requires significant expertise and is considered to be a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for constraint modelling are often limited to small, homogeneous, or domain-specific instances, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing CP-Bench, a novel benchmark that includes a diverse set of well-known combinatorial problems sourced from the CP community, structured explicitly for evaluating LLM-driven CP modelling. With this dataset, and given the variety of constraint modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 70% on this highly challenging benchmark.