Seeing the Undefined: Chain-of-Action for Generative Semantic Labels

Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu

公開日: 2024/11/26

Abstract

Recent advances in vision-language models (VLMs) have demonstrated remarkable capabilities in image classification by leveraging predefined sets of labels to construct text prompts for zero-shot reasoning. However, these approaches face significant limitations in undefined domains, where the label space is vocabulary-unknown and composite. We thus introduce Generative Semantic Labels (GSLs), a novel task that aims to predict a comprehensive set of semantic labels for an image without being constrained by a predefined labels set. Unlike traditional zero-shot classification, GSLs generates multiple semantic-level labels, encompassing objects, scenes, attributes, and relationships, thereby providing a richer and more accurate representation of image content. In this paper, we propose Chain-of-Action (CoA), an innovative method designed to tackle the GSLs task. CoA is motivated by the observation that enriched contextual information significantly improves generative performance during inference. Specifically, CoA decomposes the GSLs task into a sequence of detailed actions. Each action extracts and merges key information from the previous step, passing enriched context to the next, ultimately guiding the VLM to generate comprehensive and accurate semantic labels. We evaluate the effectiveness of CoA through extensive experiments on widely-used benchmark datasets. The results demonstrate significant improvements across key performance metrics, validating the capability of CoA to generate accurate and contextually rich semantic labels. Our work not only advances the state-of-the-art in generative semantic labels but also opens new avenues for applying VLMs in open-ended and dynamic real-world scenarios.

全文を読む (arXiv.org)