UniMotion: Self-Supervised Learning for Cross-Domain IMU Motion Recognition

2026-03-12 • Human-Computer Interaction

Human-Computer Interaction

AI summaryⓘ

The authors created UniMotion, a system that can recognize gestures from different devices like smartwatches and earbuds, and for different types of users, such as blind and sighted people. Instead of needing lots of labeled examples, UniMotion first learns from lots of unlabeled movement data, then fine-tunes using a small set of labeled gestures. They use special techniques to focus on important parts of the motion and to tell apart similar gestures. Testing showed UniMotion works well across different devices and users, achieving high accuracy with less training data than other methods.

IMUgesture recognitionself-supervised learningmotion representationpre-trainingfine-tuningwearable devicessmartwatchesearbudshuman activity data

Authors

Prerna Khanna, Tanmay Srivastava, Shubham Jain, Aruna Balasubramanian

Abstract

IMU-based gesture interfaces are being increasingly adopted as efficient, accessible, and intuitive alternatives to traditional input methods, such as touchscreens and voice. However, current gesture recognition algorithms are tailored to work for specific devices (e.g., smartwatches vs. earbuds) or user populations (e.g., blind vs. sighted users), limiting their generalizability. In this paper, we design UniMotion, a generalized IMU-based gesture recognition framework that works across devices and populations with minimal training samples. To overcome the challenges and high cost of collecting large-scale labeled training data, UniMotion leverages readily available unlabeled human activity data. The UniMotion pipeline comprises two stages: (1) pre-training a motion representation model using abundant unlabeled human activity data, and (2) fine-tuning it with a small amount of labeled gesture data. For pre-training, we introduce a token-based strategy and embeddings that learn to identify and focus attention on the key motion signatures in the temporal data For fine-tuning, we design a text-guided classifier that can reliably differentiate between temporally or semantically similar gestures. We evaluate UniMotion across both hand gestures (captured through a smartwatch) and earbud gestures (captured through earbuds), using data collected from blind and sighted users. Across these diverse devices and user populations, UniMotion achieves an accuracy of 85\%, across an average of 13 gesture classes using only 10\% of labeled data for training. UniMotion significantly outperforms state-of-the-art self-supervised learning approaches and specialized gesture recognition models.

View PDFOpen arXiv