Hierarchical Action Learning for Weakly-Supervised Action Segmentation
2026-02-27 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how humans understand actions by focusing on big changes that happen slowly, unlike machines that often get confused by small fast details. They found that high-level actions change slower than low-level visual details, making these actions easier to spot over time. Using this idea, they created the HAL model, which separates slow-changing action information from fast-changing visual details and connects them in a hierarchy. This model uses special transformers and enforces slow changes in action data to better identify actions from videos without needing much supervision. Their tests show that HAL works better than previous methods on various action recognition tasks.
hierarchical reasoningweakly-supervised learningaction segmentationlatent variablestransformer networkstimescale separationcausal data generationpyramid transformersparse transition constraintidentifiability
Authors
Junxian Huang, Ruichu Cai, Hao Zhu, Juntao Fang, Boyan Xu, Weilin Chen, Zijian Li, Shenghua Gao
Abstract
Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbf{HAL} model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbf{HAL} model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.