Detecting and Interpreting NSFW Prompts in Text-to-Image Models through Uncovering Harmful Semantics
Yiming Wang, Jiahao Chen, Qingming Li, Tong Zhang, Rui Zeng, Xing Yang, Shouling Ji
公開日: 2024/12/24
Abstract
As text-to-image (T2I) models advance and gain widespread adoption, their associated safety concerns are becoming increasingly critical. Malicious users exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, underscoring the need for effective safeguards to ensure the integrity and compliance of model outputs. However, existing detection methods often exhibit low accuracy and inefficiency. In this paper, we propose HiddenGuard, an interpretable defense framework leveraging the hidden states of T2I models to detect NSFW prompts. HiddenGuard extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. HiddenGuard also offers real-time interpretation of results and supports optimization through data augmentation techniques. Our extensive experiments show that HiddenGuard significantly outperforms both commercial and open-source moderation tools, achieving over 95\% accuracy across all datasets and greatly improves computational efficiency.