Code Generation with Small Language Models: A Codeforces-Based Study

Débora Souza, Rohit Gheyi, Lucas Albuquerque, Gustavo Soares, Márcio Ribeiro

公開日: 2025/4/9

Abstract

Large Language Models (LLMs) demonstrate capabilities in code generation, potentially boosting developer productivity. However, their adoption remains limited by high computational costs, among other factors. Small Language Models (SLMs) present a lightweight alternative. While LLMs have been evaluated on competitive programming tasks, prior work often emphasizes metrics like Elo or pass rates, neglecting failure analysis. The potential of SLMs in this space remains underexplored. In this study, we benchmark three open SLMs - Llama-3.2-3B, Gemma-3-12B, and Phi-4-14B - across 280 Codeforces problems spanning Elo ratings from 800 to 2100 and covering 36 distinct topics. All models were tasked with generating Python solutions. Phi-4-14B achieved the best SLM performance with a pass@3 of 63.6%, nearing o3-mini-high (86.8%). Combining Python and C++ outputs increased Phi-4-14B's pass@6 to 73.6%. A qualitative analysis revealed some failures stemmed from minor implementation issues rather than reasoning flaws.

Code Generation with Small Language Models: A Codeforces-Based Study | SummarXiv | SummarXiv