Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

2026-04-09Robotics

RoboticsArtificial Intelligence
AI summary

The authors explore using Conditional Neural Processes (CNP) to help robots predict their own future actions by learning from partial sensory information, inspired by how humans understand actions. They tested an existing model called Deep Modality Blending Network (DMBN) that can recreate visual and motor signals during actions, but found it struggles with new, unseen actions due to how it represents time. To fix this, they created an improved version named DMBN-Positional Time Encoding (DMBN-PTE) that better encodes temporal information, helping the model make more robust predictions. Their work is an early step toward robots that can learn to forecast actions over longer periods and update predictions as new data comes in.

Conditional Neural Processesself-supervised learningmultimodal action predictionMirror Neuron SystemDeep Modality Blending Networktemporal representationpositional encodingroboticsprobabilistic modelingaction forecasting
Authors
Marco Gabriele Fedozzi, Yukie Nagai, Francesco Rea, Alessandra Sciutti
Abstract
Inspired by the human ability to understand and predict others, we study the applicability of Conditional Neural Processes (CNP) to the task of self-supervised multimodal action prediction in robotics. Following recent results regarding the ontogeny of the Mirror Neuron System (MNS), we focus on the preliminary objective of self-actions prediction. We find a good MNS-inspired model in the existing Deep Modality Blending Network (DMBN), able to reconstruct the visuo-motor sensory signal during a partially observed action sequence by leveraging the probabilistic generation of CNP. After a qualitative and quantitative evaluation, we highlight its difficulties in generalizing to unseen action sequences, and identify the cause in its inner representation of time. Therefore, we propose a revised version, termed DMBN-Positional Time Encoding (DMBN-PTE), that facilitates learning a more robust representation of temporal information, and provide preliminary results of its effectiveness in expanding the applicability of the architecture. DMBN-PTE figures as a first step in the development of robotic systems that autonomously learn to forecast actions on longer time scales refining their predictions with incoming observations.