Collecting, Curating, and Annotating Good Quality Speech deepfake dataset for Famous Figures: Process and Challenges

Hashim Ali, Surya Subramani, Raksha Varahamurthy, Nithin Adupa, Lekha Bollinani, Hafiz Malik

Published: 2025/6/30

Abstract

Recent advances in speech synthesis have introduced unprecedented challenges in maintaining voice authenticity, particularly concerning public figures who are frequent targets of impersonation attacks. This paper presents a comprehensive methodology for collecting, curating, and generating synthetic speech data for political figures and a detailed analysis of challenges encountered. We introduce a systematic approach incorporating an automated pipeline for collecting high-quality bonafide speech samples, featuring transcription-based segmentation that significantly improves synthetic speech quality. We experimented with various synthesis approaches; from single-speaker to zero-shot synthesis, and documented the evolution of our methodology. The resulting dataset comprises bonafide and synthetic speech samples from ten public figures, demonstrating superior quality with a NISQA-TTS naturalness score of 3.69 and the highest human misclassification rate of 61.9\%.

Read Full Paper (arXiv.org)