In this work, we present TalkCuts, a large-scale benchmark dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering a wide range of identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning. The dataset, tools, and evaluation protocols will be publicly released to facilitate community progress.
We present samples from TalkCuts dataset, which features a diverse collection of videos from talk shows, TED talks, stand-up comedy, and other speech scenarios.
We provide visualization of 2D keypoints and 3D SMPLX annotation of the TalkCuts dataset.
We compare our generated human speech video with state-of-the-art audio-driven human video generation baselines.
We demonstrate our audio-driven generated long human speech video with multi camera shots.
We provide results of different state-of-the-art pose-driven human video generation baselines before and after fine-tuning on our TalkCuts dataset.
We demonstrate a long human speech video with multi camera shots generated using the fine-tuned ControlNeXT model.