MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap

Published: 2025/9/16

Abstract

Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models | SummarXiv | SummarXiv