Revisiting Speech-Lip Alignment: A Phoneme-Aware Speech Encoder for Robust Talking Head Synthesis

Yihuan Huang, Jiajun Liu, Yanzhen Ren, Wuyang Liu, Zongkun Sun

公開日: 2025/4/8

Abstract

Speech-driven talking head synthesis tasks commonly use general acoustic features as guided speech features. However, we discovered that these features suffer from phoneme-viseme alignment ambiguity, which refers to the uncertainty and imprecision in matching phonemes with visemes. To overcome this limitation, we propose a phoneme-aware speech encoder (PASE) that explicitly enforces accurate phoneme-viseme correspondence. PASE first captures fine-grained speech and visual features, then introduces a prediction-reconstruction task to improve robustness under noise and modality absence. Furthermore, a phoneme-level alignment module guided by phoneme embeddings and contrastive learning ensures discriminative audio and visual alignment. Experimental results show that PASE achieves state-of-the-art performance in both NeRF and 3DGS rendering models. Its lip sync accuracy improves by 13.7% and 14.2% compared to the acoustic feature, producing results close to the ground truth videos.

Revisiting Speech-Lip Alignment: A Phoneme-Aware Speech Encoder for Robust Talking Head Synthesis | SummarXiv | SummarXiv