Vision Language Models are In-Context Value Learners

Jason Ma^1,2,†, Joey Hejna^1,3, Ayzaan Wahid¹, Chuyuan Fu¹, Dhruv Shah¹, Jacky Liang¹, Zhuo Xu¹, Sean Kirmani¹, Peng Xu¹, Danny Driess¹, Ted Xiao¹, Jonathan Tompson¹, Osbert Bastani², Dinesh Jayaraman², Wenhao Yu¹, Tingnan Zhang¹, Dorsa Sadigh^1,3, Fei Xia¹

^† Work done while interning at Google DeepMind.

Accepted at ICLR 2025

¹Google DeepMind
²University of Pennsylvania
³Stanford University

arXiv Paper Online Demo

Listen to this paper

Interactive Illustration

GVL Zero-Shot results on OXE and 250 challenging bimanual tasks. Explore task completion predictions and frame-by-frame analysis across diverse robotic manipulation scenarios.

Description:

Abstract

Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and value-weighted regression -- all without any model training or finetuning.

Method Overview

In-Context Value Learning as Temporal Ordering

GVL auto-regressively predicts task completion percentage over shuffled frames, enabling impressive in-context value learning. GVL can effectively zero-shot and few-shot predict task progress on diverse and challenging real-world tasks; these capabilities enable expansive set of downstream applications, including dataset filtering, success detection, and policy learning.

Experiments

Through extensive benchmark, we found our proposed framework can significantly enhance the ability of visual language models in performing different types of spatial reasoning like humans, as well as unlocking novel downstream applications such as robotics.

Evaluation Metric: Value-Order Correlation

We introduce a lightweight, yet predictive method for evaluating value models at scale on real-world robotic datasets: Value-Order Correlation (VOC). This metric computes the rank correlation between the predicted values and the chronological order of the input expert video:

VOC ranges from −1 to 1, where 1 indicates that the two orderings are perfectly aligned. Expert quality demonstrations, by construction, have values that monotonically increase with time, and thus a good value model should have high VOC scores when evaluated on expert videos. On the other hand, fixing a good value model, low-quality trajectories should have low VOC scores. This is because sub-optimal trajectories often contain high repetition of visually similar frames due to the presence of redundant, re-attempt actions or poorly-placed cameras. As such, the values along the trajectories should not be monotonic, resulting in low correlation with the ground-truth timestep order.

GVL Demonstrates Multi-Modal In-Context Learning Capabilities

GVL demonstrates appealing in-context scaling as the average VOC score steadily improves as we increase the number of in-context examples. Even with 5 in-context trajectories, meaning 150 total shuffled images, GVL is able to utilize its full context and exhibit strong generalization. This result demonstrates how state-of-art long-context-window VLMs, such as Gemini-1.5-Pro, can be re-purposed to make for general-purpose value functions with impressive test-time improvement capability, quickly mastering value predictions with minimal supervision.

Cross-Embodiment and Cross-Task In-Context Learning

Examples in-context are not limited to robot demon-strations. One advantage of GVL is that it can still benefit from in-context learning even when thedemonstrations come from a different embodiment. Specifically, we record humans performing thesame tasks as the ALOHA robot demonstrations and then use these human demonstrations as in-context examples for value prediction. As shown, GVL with one cross-embodiment in-contextexample can effectively improve over its zero-shot counterpart. In the Appendix, we also show thatGVL can similarly benefit fromcross-taskin-context learning. In conclusion, GVL presents a versatileframework for in-context value learning that can scale up to even the most challenging manipulation tasks.

Downstream Applications

Dataset Quality Estimation

We demonstrate that GVL's VOC scores can be used to estimate dataset quality. For each dataset in OXE, we compute the average VOC score for its sampled trajectories and present the ranking of the average scores. The full results are presented in paper Appendix. Here, we present a subset of selected representative large-scale datasets in OXE. We see that datasets have large spread in their VOC scores, but these scores are interpretable and match human intuitions. Specifically, datasets collected from human teleoperators with relative fixed camera placements, such as RT-1 (Brohan et al., 2022), Dobb-E (Shafiullah et al., 2023), and Bridge (Ebert et al., 2021; Walke et al., 2023), have high VOC scores, despite their diversity in scenes and tasks. In contrast, datasets with autonomous data collection via scripted motions or motor babbling, such as QT-OPT (Kalashnikov et al., 2018) and RoboNet (Dasari et al., 2019), contain high number of suboptimal trajectories that do not exhibit smooth temporal structure to be re-shuffled.

Dataset	Avg. VOC
RT-1 (Brohan et al., 2022)	0.74
Dobb-E (Shafiullah et al., 2023)	0.53
Bridge (Walke et al., 2023)	0.51
QT-OPT (Kalashnikov et al., 2018)	0.19
DROID (Khazatsky et al., 2024)	-0.01
RoboNet (Dasari et al., 2019)	-0.85

Success Detection and Filtered BC

The VOC score can be used as a threshold score for success detection. The resulting success detection method, GVL-SD, substantially outperforms SuccessVQA using the same VLM on all evaluation metrics. Furthermore, filtered BC with GVL-SD always outperforms the base imitation learning algorithm (ACT) regardless of the threshold value.

Method	Accuracy	Precision	Recall
GVL-SD (Zero-Shot)	0.71	0.71	0.71
GVL-SD (One-Shot)	0.75	0.85	0.70
SuccessVQA (Du et al., 2023)	0.62	0.33	0.73
SuccessVQA-CoT	0.63	0.44	0.68

Qualitatively, we indeed see that GVL assigns much higher scores to successful trajectories than failure trajectories; in contrast, removing the shuffling mechanism in GVL significantly reduces its discriminability on failure trajectories.

Real-World Advantage-Weighted Regression

We illustrate how GVL can assign importance weights to individual transitions within trajectories at a fine-grained level akin to offline reinforcement learning. For these experiments we use real-world demonstration data collected by human teleoperation on bi-manual ALOHA robot setups. Unlike simulation, our datasets only contain successful task executions but can be sub-optimal and multi-modal. Thus, we directly utilize GVL’s values with advantage weighted regression (AWR) (Peters & Schaal, 2007; Peng et al., 2019), in which we weight each individual transition by the estimated advantange, or GVL value difference for that step:

Real-World ALOHA Tasks	GVL + DP	DP	Avg. VOC
bowl-in-rack	7/10	6/10	0.57
banana-handover	7/10	5/10	0.73
close-laptop	9/10	6.5/10	0.59
open-drawer	4/10	6/10	0.09
remove-gears	4.67/10	7/10	0.19
pen-handover	1.5/10	0/10	0.43
fold-dress	7/10	7/10	0.66

We use diffusion policy (DP) as the policy backbone (Chi et al., 2023) for each task, and compare training diffusion policies with GVL (One-Shot) advantage weighting or lack thereof. We evaluate on 7 tasks with 10 trials per task and report success rate in Table 3. As can be seen, on a majority tasks, GVL-DP outperforms DP and we see a clear correlation between improvement over DP and the VOC score. That is, when the value predictions are of high quality as judged by VOC, policy learning can benefit from GVL value weighting. On open-drawer and remove-gears, the top-down view does not provide sufficient resolution to distinguish task progress (see Fig. 8), as a consequence, the value predictions can be noisy, which can hurt policy learning. However, given the in-context learning results, we believe that it is possible to improve policy learning even on difficult tasks with non-ideal camera viewpoints.

Qualitative Results

We show qualitative results on several robotic manipulation tasks below trained with Advantage-Weighted Regression. For each task, we show both successful and failed attempts.

Success

Fail

autonomous, 1x speed

GVL Online Demo (0-shot)

Zero-shot One-shot

GVL Live Demo

Upload a video enter your Gemini API key and task description, then first shuffle it, and "Get Response" to analyze the frames.

After receiving the response, you can click "Parse Response" to see the predicted task completion percentage for each frame. You can toggle back to GT order to examine the predicted value function as well as the caption.

Or click one of the examples below to try:

Folding Dress Example

Glass on Rack Example

API Key: (Get API key here)

Task Description:

Upload Video:

Green numbers show ground truth frame order. Red numbers show shuffled frame order.

Processing request...

Ask Questions About the Paper

Acknowledgement

We thank Jie Tan, Pannag Sanketi, Oliver Groth, and the rest the Google DeepMind Robotics team for helpful discussions and providing feedback on the paper.

BibTeX

[arxiv]

@preprint{ma2024generative,
    author    = {Ma, Yecheng Jason and Hejna, Joey and Wahid, Ayzaan and Fu, Chuyuan and Shah, Dhruv and Liang, Jacky and Xu, Zhuo and Kirmani, Sean and Xu, Peng and Driess, Danny and Xiao, Ted and Tompson, Jonathan and Bastani, Osbert and Jayaraman, Dinesh and Yu, Wenhao and Zhang, Tingnan and Sadigh, Dorsa and Xia, Fei},
    title     = {Vision Language Models are In-Context Value Learners},
    booktitle = {preprint},
    year      = {2024}
}