TruthLens: Visual Grounding for Universal DeepFake Reasoning

Rohit Kundu, Shan Jia, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

Published: 2025/3/20

Abstract

Detecting DeepFakes has become a crucial research area as the widespread use of AI image generators enables the effortless creation of face-manipulated and fully synthetic content, while existing methods are often limited to binary classification (real vs. fake) and lack interpretability. To address these challenges, we propose TruthLens, a novel, unified, and highly generalizable framework that goes beyond traditional binary classification, providing detailed, textual reasoning for its predictions. Distinct from conventional methods, TruthLens performs MLLM grounding. TruthLens uses a task-driven representation integration strategy that unites global semantic context from a multimodal large language model (MLLM) with region-specific forensic cues through explicit cross-modal adaptation of a vision-only model. This enables nuanced, region-grounded reasoning for both face-manipulated and fully synthetic content, and supports fine-grained queries such as "Does the eyes/nose/mouth look real or fake?"- capabilities beyond pretrained MLLMs alone. Extensive experiments across diverse datasets demonstrate that TruthLens sets a new benchmark in both forensic interpretability and detection accuracy, generalizing to seen and unseen manipulations alike. By unifying high-level scene understanding with fine-grained region grounding, TruthLens delivers transparent DeepFake forensics, bridging a critical gap in the literature.

TruthLens: Visual Grounding for Universal DeepFake Reasoning | SummarXiv | SummarXiv