Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, Klara Nahrstedt

Published: 2025/6/20

Abstract

Inference-time techniques such as decoding-time scaling and self-refinement have been shown to substantially improve reasoning in large language models (LLMs), driven by emergent self-correction and self-verification behaviors often elicited through reinforcement learning (RL). In this work, we investigate whether these inference-time scaling methods similarly benefit vision-language models (VLMs), especially those fine-tuned with RL. Through extensive evaluation, we find that while strategies like majority vote and best-of-N with self-verification enhance VLM performance, majority vote significantly outperforms verification-centric ones. Furthermore, inference time scaling behaviors commonly associated with RL-tuned models, such as the 'A-ha moment,' do not yield consistent performance gains. Our analysis identifies a key limitation: current RL-trained VLMs exhibit weak self-verification across both visual and textual modalities, limiting the effectiveness of inference-time scaling.

Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? | SummarXiv | SummarXiv