ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors


Zihao Huang1,2,3   Tianqi Liu1,2,3   Zhaoxi Chen2   Shaocong Xu3   Saining Zhang2,3  
Lixing Xiao5   Zhiguo Cao1   Wei Li2   Hao Zhao4,3*   Ziwei Liu2*

1HUST     2NTU     3BAAI     4AIR, THU     5ZJU

*Corresponding Authors

TL;DR


ArtHOI is the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from monocular video priors. We achieve RGB rendering, articulated object modeling, physical constraint modeling, and zero-shot generalization—without 3D supervision.



Diverse Articulated Interactions

ArtHOI synthesizes realistic human-object interactions with articulated objects such as doors, cabinets, fridges, and microwaves—maintaining proper contact and kinematic constraints.



Abstract


ArtHOI Teaser

Synthesizing physically plausible articulated human-object interactions (HOI) without 3D/4D supervision remains a fundamental challenge. While recent zero-shot approaches leverage video diffusion models to synthesize human-object interactions, they are largely confined to rigid-object manipulation and lack explicit 4D geometric reasoning. To bridge this gap, we formulate articulated HOI synthesis as a 4D reconstruction problem from monocular video priors: given only a video generated by a diffusion model, we reconstruct a full 4D articulated scene without any 3D supervision. This reconstruction-based approach treats the generated 2D video as supervision for an inverse rendering problem, recovering geometrically consistent and physically plausible 4D scenes that naturally respect contact, articulation, and temporal coherence. We introduce ArtHOI, the first zero-shot framework for articulated human-object interaction synthesis via 4D reconstruction from video priors. Our key designs are: 1) Flow-based part segmentation: leveraging optical flow as a geometric cue to disentangle dynamic from static regions in monocular video; 2) Decoupled reconstruction pipeline: joint optimization of human motion and object articulation is unstable under monocular ambiguity, so we first recover object articulation, then synthesize human motion conditioned on the reconstructed object states. ArtHOI bridges video-based generation and geometry-aware reconstruction, producing interactions that are both semantically aligned and physically grounded. Across diverse articulated scenes (e.g., opening fridges, cabinets, microwaves), ArtHOI significantly outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity.



Method


ArtHOI Pipeline

Overview of ArtHOI. Given a monocular video \(\mathcal{V} = \{I(t)\}_{t=1}^T\) (generated from a text prompt or captured from real scenes), we formulate interaction synthesis as a 4D reconstruction problem. Stage I identifies articulated object parts via flow-based segmentation (point tracking, SAM-guided masks, back projection to 3D Gaussians, quasi-static binding), then recovers object articulation with kinematic constraints. Stage II refines human motion (SMPL-X) conditioned on the reconstructed 4D object scaffold, using 3D contact keypoints derived from 2D evidence and enforcing kinematic, collision, and foot-sliding losses. The decoupled design avoids monocular ambiguity and yields geometrically consistent, physically plausible articulated HOI.



Comparisons

ArtHOI outperforms prior methods in contact accuracy, penetration reduction, and articulation fidelity. Zero-shot baselines (e.g., ZeroHSI) treat objects as rigid and fail to model part-wise articulation.
Comparisons


Citation


@article{huang2026arthoi,
  title={ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors},
  author={Huang, Zihao and Liu, Tianqi and Chen, Zhaoxi and Xu, Shaocong and Zhang, Saining and Xiao, Lixing and Cao, Zhiguo and Li, Wei and Zhao, Hao and Liu, Ziwei},
  journal={arXiv preprint arXiv:2603.04338},
  year={2026}
}