Embodied visual planning guides manipulation by imagining how a scene evolves toward a goal, yet existing video diffusion models remain purely forward-predictive, causing drift and goal misalignment. We introduce Envision, which explicitly constrains generation with a goal image to ensure physical plausibility and goal consistency. Envision operates in two stages: a Goal Imagery Model produces a coherent, task-relevant goal image, and a first-and-last-frame-conditioned video diffusion model interpolates between the start and goal states. Across manipulation and editing benchmarks, Envision achieves superior goal alignment, spatial consistency, and object preservation, providing reliable visual plans for downstream robotic control.
Visual comparison against Baselines on the Taste-Rob and RT-1 datasets.
Take the green glasses from the open drawer and put them on the desk top, human hand
Move the blue and white eraser next to the calendar in English, human hand
Move sponge near blue chip bag, google robot
Move rxbar blueberry near green jalapeno chip bag, google robot
Move pepsi can near rxbar chocolate, google robot
Pick banana from white bowl, google robot
Move the blue-white eraser next to the pink mug, human hand
Place blue plastic bottle into middle drawer, google robot
Move the green scissors next to the pink calculato, human hand
Move the front apple from left to the right, human hand.
Move the paint bucket from left to the right, human hand.
Pick up the corn and put it on the right side of sausages, google robot.
Move the red Chinese chess forward on the board
Pick up the wine bottle and place it on the right side of the table
Sort the blue cube on the top of the red cube.
Place the apple on the bucket.
We observe failures in cases of extreme occlusion.
Failure: Hang the tool in multi stages.
@N/A{\\
}