Towards Unconstrained Human-Object Interaction

2026-04-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors look at how computers understand interactions between people and objects in pictures, which usually depends on knowing a fixed set of possible actions beforehand. They use advanced language and vision models called Multimodal Large Language Models (MLLMs) to try a new approach where the computer doesn’t need a predefined list of actions. This new task, called Unconstrained HOI (U-HOI), lets the model recognize any interaction it sees. They test various MLLMs and create a method to turn the model's descriptive text into clear, structured data about interactions. Their work shows the current limitations of older methods and suggests MLLMs could make HOI detection more flexible.

Human-Object Interaction (HOI)Multimodal Large Language Models (MLLMs)Unconstrained HOI (U-HOI)computer visioninteraction recognitiontest-time inferencelanguage-to-graph conversionfree-form textstructured data

Authors

Francesco Tonini, Alessandro Conti, Lorenzo Vaquero, Cigdem Beyan, Elisa Ricci

Abstract

Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi

View PDFOpen arXiv