Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

2026-06-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created Demo2Tutorial, a system that turns screen recordings and interaction logs of people using software into clear, step-by-step tutorials. Their method breaks down actions, goals, and intentions from recorded interactions and organizes them into easy-to-follow instructions with images and text. They tested their system and found it makes tutorials that help humans learn faster and also improve how software agents understand and perform tasks on graphical user interfaces. This shows that learning from real human interactions can improve both teaching and automated task planning.

screen recordinginteraction logsmultimodal tutorialsGUI agenttask planningaction parsinghierarchical task graphssoftware tutorialshuman learningagent learning
Authors
Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, Xin Wang, Mike Zheng Shou
Abstract
Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at https://github.com/showlab/Demo2Tutorial.