BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

João Guilherme Alves Santos, Giovana Kerche Bonás, Thales Sales Almeida

公開日: 2025/8/29

Abstract

With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.

BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning | SummarXiv | SummarXiv