LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis


Haojie Zhang1*,   Zhihao Liang1*,   Ruibo Fu2,   Zhengqi Wen3,   Xuefei Liu2, Chenxing Li4,   Jianhua Tao3,5,   Yaling Liang1
1South China University of Technology
2Institute of automation, Chinese Academy of Sciences
3Beijing National Research Center for Information Science and Technology, Tsinghua University
4AI Lab, Tencent   5Department of Automation, Tsinghua University

TL;DR: We present LetsTalk, an innovative Diffusion Transformer with tailored fusion schemes for audio-driven portrait animation, achieving excellent portrait consistency and liveliness in the generated animations.

We propose LetsTalk, a diffusion-based transformer for audio-driven portrait image animation. Left: Given a single reference image and audio, LetsTalks can produce a realistic and vivid video aligned with the input audio. Note that each column corresponds to the same audio. The results show that LetsTalk can drive consistent and reasonable mouth motions for the input audio. Right: Generation quality vs. inference time on the HDTF dataset, the circle area reflects the number of parameters of the method. Our LetsTalk achieves the best quality while also being highly efficient in inference, compared to current mainstream diffusion-based methods such as Hallo and AniPortrait. In addition, our base version LetsTalk-B achieves performance similar to Hallo with only 8 × fewer parameters.

Abstract


Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.


Method


The overview of our method (a) and the illustration of our designed transformer block (b). For better illustration, we omit the timestep encoder and Layer Norm in (b). LetsTalk integrates transformer blocks equipped with both temporal and spatial attention modules, designed to capture intra-frame spatial details and establish temporal correspondence across time steps. After obtaining portrait and audio embeddings, Symbiotic Fusion is used for fusing the portrait embedding and Direct Fusion is for fusing the audio embedding. Notably, we repeat the portrait embedding along the frame axis to make it have the same shape as the noise embedding.


Illustration of three multimodal fusion schemes, our transformer backbone is formed by the left-side blocks.
(a) Direct Fusion. Directly feeding condition into each block's cross-attention module;
(b) Siamese Fusion. Maintaining a similar transformer and feeding the condition into it, extracting the corresponding features to guide the features in the backbone;
(c) Symbiotic Fusion. Concatenating modality with the input at the beginning, then feeding it into the backbone, achieving fusion via the inherent self-attention mechanisms.


Visualization


English Speaking 1



English Speaking 2



English Speaking 3



Chinese Speaking 1



Chinese Speaking 2



Singing



AI-generated Portraits


BibTeX


@misc{zhang2024letstalklatentdiffusiontransformer,
    title={LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis}, 
    author={Haojie Zhang and Zhihao Liang and Ruibo Fu and Zhengqi Wen and Xuefei Liu and Chenxing Li and Jianhua Tao and Yaling Liang},
    year={2024},
    eprint={2411.16748},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.16748}, 
}