TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Abstract

In this work, we present TalkCuts, a large-scale benchmark dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering a wide range of identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning. The dataset, tools, and evaluation protocols will be publicly released to facilitate community progress.

TalkCuts Dataset

Dataset Overview

We present samples from TalkCuts dataset, which features a diverse collection of videos from talk shows, TED talks, stand-up comedy, and other speech scenarios.

Dataset Annotations

We provide visualization of 2D keypoints and 3D SMPLX annotation of the TalkCuts dataset.

Dataset

Dataset

Audio-driven Video Generation Comparison

We compare our generated human speech video with state-of-the-art audio-driven human video generation baselines.

Audio-driven Multi-Shot Long Video Generation Results

We demonstrate our audio-driven generated long human speech video with multi camera shots.

Pose-driven Video Generation Demonstrations

We provide results of different state-of-the-art pose-driven human video generation baselines before and after fine-tuning on our TalkCuts dataset.

Pose-driven Multi-Shot Long Video Generation Result

We demonstrate a long human speech video with multi camera shots generated using the fine-tuned ControlNeXT model.