VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
2026-05-04 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors introduce VideoNet, a new benchmark that focuses on recognizing specific actions across many different areas to test how well video-language models (VLMs) understand videos. They show that current VLMs have a hard time identifying actions correctly, even when given helpful examples or clues. To help improve this, the authors created a large training dataset with lots of video question-answer pairs and used it to fine-tune a model, which then performed better than other similar models. Their work highlights the challenges VLMs face with action recognition and suggests that training on specialized data can help.
VideoNetAction RecognitionVision-Language Models (VLMs)Domain-Specific ActionsFew-Shot LearningIn-Context ExamplesFine-TuningVideo Question-AnsweringBenchmark Dataset
Authors
Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna
Abstract
Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.