Intend to Move (I2M) is a new multimodal dataset for embodied AI, designed for intention-aware human motion understanding in real-world environments.
Human motion is inherently intentional, yet most motion modeling paradigms remain confined to kinematic prediction, neglecting the semantic and causal structure that drives behavior. Existing datasets capture short, decontextualized actions in static settings, providing little grounding for embodied reasoning. We introduce Intend to Move (I2M), a large-scale, multimodal dataset for intention-aware embodied motion modeling. I2M contains 10.1 hours of two-person 3D motion sequences recorded in dynamic, realistic home environments, accompanied by synchronized multi-view RGB-D video, detailed 3D scene geometry, and timestamped language annotations of each participant’s evolving intentions. Benchmark experiments reveal a fundamental gap in current motion models: they fail to translate high-level goals into physically and socially coherent motion. I2M thus serves not only as a dataset but as a benchmark for embodied intelligence, fostering models that can reason about, predict, and act upon the ``why'' behind human motion.
