Learning to Act from Actionless Videos through Dense Correspondences

In this work, we present an approach to construct a video-based robot policy capable of successfully executing diverse tasks across different robots and environments without the need of any action annotations. Our method leverages images as a task-agnostic representation, encoding both the state and action information. By synthesizing videos that "hallucinate " robot executing actions and in combination with dense correspondences between frames, our approach can infer the closed-formed action to execute to an environment without the need of any explicit action labels. This unique capability allows us to train the policy solely based on RGB videos and deploy learned policies to various robotic tasks. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models using just 4 GPUs within a single day.

Framework Overview

(a) Our model takes the RGBD observation of the current environmental state and a textual goal description as its input.
(b) It first synthesizes a video of imagined execution of the task using a diffusion model.
(c) Next, it estimates the optical flow between adjacent frames in the video.
(d) Finally, it leverages the optical flow as dense correspondences between frames and the depth of the first frame to compute SE(3) transformations of the target object, and subsequently, robot arm commands.

Extended Qualitative Results

Meta-World

Meta-World (Yu et al., 2019) is a simulated benchmark featuring various manipulation tasks with a Sawyer robot arm. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.

Synthesized Videos

Robot Executions

Task: Assembly

Task: Door Open

Task: Hammer

Task: Shelf Place

iTHOR

iTHOR (Kolve et al., 2017) is a simulated benchmark for embodied common sense reasoning. We consider the object navigation tasks for evaluation, where an agent randomly initialized into a scene learns to navigate to an object of a given type (e.g., toaster, television). We present the video plans synthesized by our video diffusion model as well as robot navigation videos as follows.

Synthesized Videos

Robot Navigation

Task: Pillow

Task: Soap Bar

Task: Television

Task: Toaster

Cross-Embodiment Learning (Visual Pusher)

We aim to examine if our method can achieve cross-embodiment learning, e.g., leverage human demonstration videos to control robots to solve tasks. To this end, we learn a video diffusion model from only actionless human pushing videos from Visual Pusher (Schmeckpeper et al., 2021, Zakka et al., 2022) and then evaluate our method on simulated robot pushing tasks without any fine-tuning. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.

Failed executions

input image

video plan

execution

input image

video plan

execution

Successful executions

input image

video plan

execution

input image

video plan

execution

Real-World Franka Emika Panda Arm with Bridge Dataset

We aim to examine if our method can tackle real-world robotics tasks. To this end, To this end, we train our video generation model on the Bridge dataset (Ebert et al., 2022), and perform evaluation on a real-world Franka Emika Panda tabletop manipulation environment. We present the video plans synthesized by our video diffusion model as well as robot execution videos as follows.

Synthesized Videos

Robot Executions

Task: put apple in plate

Task: put banana in plate

Task: put peach in blue bowl

Zero-Shot Generalization on Real-World Scene with Bridge Model

While most tasks in the Bridge data were recorded in toy kitchens, we found that the video diffusion model trained on this dataset already can generalize to complex real-world kitchen scenarios, producing reasonable videos given RGB images and textual task descriptions. We present some examples of the synthesized videos below. Note that the videos are blurry since the original video resolution is low (48x64).

Task: pick up banana

generated video

Task: put lid on pot

generated video

Task: put pot in sink

generated video

Comparison of First-Frame Conditioning Strategy and
Different Text Encoders

We compare our proposed first-frame conditioning strategy (cat_c) with the naive frame-wise concatenate strategy (cat_t). Our method (cat_c) consistently outperforms the frame-wise concatenating baseline (cat_t) when training on the Bridge dataset. Below we provide some qualitative examples of synthesized videos with 40k training steps.

Improving Inference Efficiency with
Denoising Diffusion Implicit Models

This section investigates the possibility of accelerating the sampling process using Denoising Diffusion Implicit Models (DDIM; Song et al., 2021). To this end, instead of iterative denoising 100 steps, as reported in the main paper, we have experimented with different numbers of denoising steps (e.g., 25, 10, 5, 3) using DDIM. We found that we can generate high-fidelity videos with only 1/10 of the samplimg steps (10 steps) with DDIM, allowing for tackling running time-critical tasks. We present the synthesized videos with 25, 10, 5, 3 denoising steps as follows.

DDIM 25 steps: The quality of the synthesized videos are satisfactory depsite minor temporal inconsistency (gripper/object disappeared/duplicated) compared to our DDPM (100 steps) videos reported in previous section.

DDIM 10 steps: The quality of the synthesized videos is similar to those generated with 25 steps.

DDIM 5 steps: The temporal inconsistency issue is more severe with only 5 denoising steps.

DDIM 3 steps: The temporal inconsistency issue is more severe and some objects are blurry.

Common Failure Modes of BC baselines

In this section, we provide some qualitative examples of common failure modes of the most challenging tasks for BC baselines in our Meta-World setting. The tasks include shelf-place, basketball, and assembly, which requires the robot to grasp an object and move it to another position. The most common failure modes we observed for BC policies is that it often makes inaccurate grasps when it's still far away from the object and won't recover (retry for another grasp) after that. This failure mode persists in plain BC even when 50 demonstrations per view are provided. Different from BC, the Diffusion Policy makes contact with the object consistently and thus makes successful grasps more often than BC does. We also observed some re-grasp behaviors from the Diffusion Policy (see shelf-place and basketball tasks), which is not observed in BC. However, the Diffusion Policy can still fail to grasp the object because the object is small in the image and can slip due to inaccurate grasps.

BC (5 demonstrations per view)

Task: shelf-place

Task: basketball

Task: assembly

BC (50 demonstrations per view)

Task: shelf-place

Task: basketball

Task: assembly

Diffusion Policy (5 demonstrations per view)

Task: shelf-place

Task: basketball

Task: assembly

Learning to Act from Actionless Video
through Dense Correspondences

Table of contents

Framework Overview

Extended Qualitative Results

Meta-World

Synthesized Videos

Robot Executions

iTHOR

Synthesized Videos

Robot Navigation

Cross-Embodiment Learning (Visual Pusher)

Failed executions

Successful executions

Real-World Franka Emika Panda Arm with Bridge Dataset

Synthesized Videos

Robot Executions

Zero-Shot Generalization on Real-World Scene with Bridge Model

Comparison of First-Frame Conditioning Strategy and
Different Text Encoders

Improving Inference Efficiency with
Denoising Diffusion Implicit Models

Common Failure Modes of BC baselines

BC (5 demonstrations per view)

BC (50 demonstrations per view)

Diffusion Policy (5 demonstrations per view)

Learning to Act from Actionless Video through Dense Correspondences

Table of contents

Framework Overview

Extended Qualitative Results

Meta-World

Synthesized Videos

Robot Executions

iTHOR

Synthesized Videos

Robot Navigation

Cross-Embodiment Learning (Visual Pusher)

Failed executions

Successful executions

Real-World Franka Emika Panda Arm with Bridge Dataset

Synthesized Videos

Robot Executions

Zero-Shot Generalization on Real-World Scene with Bridge Model

Comparison of First-Frame Conditioning Strategy andDifferent Text Encoders

Improving Inference Efficiency withDenoising Diffusion Implicit Models

Common Failure Modes of BC baselines

BC (5 demonstrations per view)

BC (50 demonstrations per view)

Diffusion Policy (5 demonstrations per view)

Learning to Act from Actionless Video
through Dense Correspondences

Comparison of First-Frame Conditioning Strategy and
Different Text Encoders

Improving Inference Efficiency with
Denoising Diffusion Implicit Models