CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

2026-06-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created CalTennis, a huge collection of tennis videos from multiple cameras to help computers learn how to guess 3D body poses from just one camera. Their dataset is much bigger than past ones and includes expert players, which is new for this kind of work. They made it easy to collect and sync video data without fancy gear. When testing current methods, they found that while joint angles are estimated well, depth and foot contact remain hard problems. They also introduced new ways to measure errors related to foot movement and stability, showing where pose estimation can get better.

3D pose estimationmonocular videomulti-view videomotion capture (MOCAP)video calibrationsynchronizationjoint angle recoveryfoot contact detectionperformance metricstennis biomechanics
Authors
Ilona Demler, Xinran Xie, Blake Werner, Anna Szczuka, Pietro Perona
Abstract
The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.