Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning

Federico Tavella, Amber Drinkwater, Angelo Cangelosi

公開日: 2025/6/24

Abstract

Vision-Language Models (VLMs) have emerged as powerful tools for generating textual descriptions from visual data. While these models excel on web-scale datasets, their robustness to the domain shifts inherent in many real-world applications remains under-explored. This paper presents a systematic evaluation of VLM performance on a single-view object captioning task when faced with a controlled, physical domain shift. We compare captioning accuracy across two distinct object sets: a collection of multi-material, real-world tools and a set of single-material, 3D-printed items. The 3D-printed set introduces a significant domain shift in texture and material properties, challenging the models' generalization capabilities. Our quantitative results demonstrate that all tested VLMs show a marked performance degradation when describing the 3D-printed objects compared to the real-world tools. This underscores a critical limitation in the ability of current models to generalize beyond surface-level features and highlights the need for more robust architectures for real-world signal processing applications.

全文を読む (arXiv.org)