Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
Dasol Choi, Seunghyun Lee, Youngsook Song
Published: 2025/5/21
Abstract
Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical scenarios remains insufficiently explored. We introduce VERI, a diagnostic benchmark comprising 200 synthetic images (100 contrastive pairs) and an additional 50 real-world images (25 pairs) for validation. Each emergency scene is paired with a visually similar but safe counterpart through human verification. Using a two-stage evaluation protocol (risk identification and emergency response), we assess 17 VLMs across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem": models achieve high recall (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. This "better-safe-than-sorry" bias stems from contextual overinterpretation (88-98% of errors). Both synthetic and real-world datasets confirm these systematic patterns, challenging VLM reliability in safety-critical applications. Addressing this requires enhanced contextual reasoning in ambiguous visual situations.