ENVISION: Embodied Visual Planning via Goal-Imagery Video Diffusion


1USC 2Bytedance Intelligent Creation 3Stanford 4MIT 5MBZUAI

Abstract

Embodied visual planning guides manipulation by imagining how a scene evolves toward a goal, yet existing video diffusion models remain purely forward-predictive, causing drift and goal misalignment. We introduce Envision, which explicitly constrains generation with a goal image to ensure physical plausibility and goal consistency. Envision operates in two stages: a Goal Imagery Model produces a coherent, task-relevant goal image, and a first-and-last-frame-conditioned video diffusion model interpolates between the start and goal states. Across manipulation and editing benchmarks, Envision achieves superior goal alignment, spatial consistency, and object preservation, providing reliable visual plans for downstream robotic control.

Method Pipeline

Method Pipeline
Figure 1: Given a single environment image and an instruction prompt, our pipeline generates a physically plausible and goal-aligned video depicting the instructed manipulation in a two-stage manner. Each stage corresponds to a trainable component: (left) a Goal Imagery Model that predicts the target goal frame, and (right) an Env–Goal Video Model that synthesizes the full sequence conditioned on both the environment and goal images.

Experimental Results

Comparison to Previous Works

Visual comparison against Baselines on the Taste-Rob and RT-1 datasets.

Ablation on Goal-Image Generation

Cross Embodiment Video Generation

Comparison to Previous Works on Robot Execution



Failure Cases

We observe failures in cases of extreme occlusion.

BibTeX

@N/A{\\
}