In this work, we present an approach to construct a video-based robot policy capable of successfully executing diverse tasks across different robots and environments without the need of any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information. By synthesizing videos that "hallucinate " robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models using just 4 GPUs within a single day.
Meta-World (Yu et al., 2019) is a simulated benchmark featuring various manipulation tasks with a Sawyer robot arm. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.





























iTHOR (Kolve et al., 2017) is a simulated benchmark for embodied common sense reasoning. We consider the object navigation tasks for evaluation, where an agent randomly initialized into a scene learns to navigate to an object of a given type (e.g., toaster, television). We present the video plans synthesized by our video diffusion model as well as robot navigation videos as follows.
































We aim to examine if our method can achieve cross-embodiment learning, e.g., leverage human demonstration videos to control robots to solve tasks. To this end, we learn a video diffusion model from only actionless human pushing videos from Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) and then evaluate our method on simulated robot pushing tasks without any fine-tuning. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.
We aim to examine if our method can tackle real-world robotics tasks. To this end, To this end, we train our video generation model on the Bridge dataset (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.






























While most tasks in the Bridge data were recorded in toy kitchens, we found that the video diffusion model trained on this dataset already can generalize to complex real-world kitchen scenarios, producing reasonable videos given RGB images and textual task descriptions. We present some examples of the synthesized videos below. Note that the videos are blurry since the original video resolution is low (48x64).
We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.
This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.




















































































































In this section, we provide some qualitative examples of common failure modes of the most challenging tasks for BC baselines in our Meta-World setting. The tasks include shelf-place, basketball, and assembly, which requires the robot to grasp an object and move it to another position. The most common failure modes we observed for BC policies is that it often makes inaccurate grasps when it's still far away from the object and won't recover (retry for another grasp) after that. This failure mode persists in plain BC even when 50 demonstrations per view are provided. Different from BC, the Diffusion Policy makes contact with the object consistently and thus makes successful grasps more often than BC does. We also observed some re-grasp behaviors from the Diffusion Policy (see shelf-place and basketball tasks), which is not observed in BC. However, the Diffusion Policy can still fail to grasp the object because the object is small in the image and can slip due to inaccurate grasps.