Video Results

Automatic Synthetic Data Generation

Sitting Interaction Trees
Liftting Interaction Trees

Comparison to Baseline Methods

Sitting
Lifting

Additional Qualitative Results

Sitting
Lifting

Automatic Synthetic Data Generation (Sec 3.3)
Here we show qualitative samples of the interaction motions generated by our automated data synthesis method described in Sec 3.3 of the main paper. We collect anchor interaction poses from the BEHAVE dataset, and use a pre-trained object-agnostic motion model (HuMoR) to synthesize the data. Each video here visualizes one generated tree starting from a single anchor pose. Note, the final frame of each motion is this anchor pose. The final dataset we collect uses several trees generated for each anchor pose.

Sitting Motions
Examples of trees generated for sitting on a chair (top) and table (bottom). Note the variety of motions that are synthesized based on the same anchor pose in each video, due to the random sampling of the motion model.

Lifting Motions
Examples of trees generated for lifting a suitcase (left) and stool (right).

Comparison to Baseline Methods (Sec 4.4)
Here we show qualitative examples comparing our full method using the learned object interaction field (NIFTY) to baselines on sitting and lifting actions. Generally, the Cond. VAE baseline (left) produces unrealistic motions with foot skating and incompatible actions with the object. The Cond. MDM baseline (middle) produces realistic motions from the diffusion model, but performs the action far way from the object since it does not use interaction field guidance. Our method (right) consistently moves close to the object and carries out the interaction faithfully.

Sitting Motions (60 second video clip with 11 different interaction)
Sitting Interactions generated for chairs, tables, yogaballs, and stools. New intearction after every crossfade.

Lifting Motions (60 second video clip 11 different intearctions)
Lifting Interactions generated for chairs, tables, suitcases, and stools. New intearction after every crossfade.

NIFTY: Additional Qualitative Results (Sec 4.4)
Here we show additional qualitative results of our full method, which uses a conditional motion diffusion model paired with a learned object interaction field. Each video shows about 10 results, each 5 seconds long, for a different object instance. The object is randomly placed in the scene to evaluate generalization.

Sitting Motions
Our method only requires a small set of anchor poses to generate synthetic training data, so we are able to train models for interactions with little available mocap data such as sitting on a table, yogaball, and stool. New interactions start at every crossfade.


Wooden Chair (71 sec ~ 13 interactions)		Arm Chair (60 sec ~ 11 interactions)		Square Table (93 seconds ~ 17 interactions)


Yogaball (60 seconds ~ 11 interactions)		Stool (65 seconds ~ 12 interactions)

Lifting Motions
Sampled interactions for lifting chairs, tables, suitcases, and stools. New interactions start at every crossfade.


Wooden Chair (77 seconds ~ 14 interactions)		Stool (82 seconds ~ 15 interactions)

Square Table (77 seconds ~ 14 interactions)		Suitcase (77 seconds ~ 14 interactions)