Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.
Extract authors, key findings, references, and an executive summary using AI.
ROBOT-R1 presents a novel framework that leverages reinforcement learning to enhance embodied reasoning capabilities in Large Vision-Language Models for robotic control. Unlike conventional Supervised Fine-Tuning approaches that suffer from catastrophic forgetting and limited generalization, ROBOT-R1 trains models to predict next keypoint states through explicit reasoning processes optimized via the GRPO algorithm. The framework reformulates continuous state prediction as multiple-choice question-answering tasks, making the learning process more tractable while incorporating auxiliary tasks for current state and movement prediction. Despite using only 7 billion parameters, ROBOT-R1-trained models exceed the performance of larger commercial models including GPT-4o on spatial and movement reasoning tasks critical for low-level robot control. The paper introduces ROBOT-R1 Bench, a comprehensive benchmark featuring 215 open-ended questions evaluating four key reasoning types: planning, high-level action reasoning, movement reasoning, and spatial reasoning. Extensive evaluations across multiple benchmarks demonstrate ROBOT-R1's effectiveness: 28% improvement on embodied reasoning tasks, 31% improvement on EmbodiedBench Manipulation, and substantial gains on spatial reasoning benchmarks. Crucially, the learned reasoning capabilities transfer effectively to downstream tasks, real-world robot environments, and other robotic benchmarks, whereas SFT-trained models often show degradation on out-of-distribution tasks. The research reveals that embodied reasoning patterns during RL training naturally evolve toward shorter, more focused reasoning traces compared to the longer chains observed in mathematical reasoning domains. Models transition from summary-format reasoning to narrative-format responses that coherently connect reasoning components directly to task-relevant information. Ablation studies confirm the critical importance of auxiliary tasks and the MCQA framework design, while the approach remains robust across different random seeds and compatible with cold-start scenarios using pre-trained CoT data. Beyond embodied reasoning evaluation, ROBOT-R1 demonstrates practical impact on robot control performance, improving pick-and-place success rates on real robots from 16.67% to 23.96%. The framework is computationally efficient, requiring only 12 hours of training on four A100 GPUs, making it accessible for research labs and smaller organizations. These findings suggest that RL-optimized embodied reasoning represents a promising direction for developing practical robot intelligence that combines the reasoning capabilities of large models with task-specific optimization.
Sumin Park
KAIST
Huiwon Jang
KAIST
Jinwoo Shin
KAIST
Jaehyung Kim
Yonsei University
Younggyo Seo
UC Berkeley
Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce ROBOT-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. ROBOT-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, ROBOT-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate ROBOT-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, ROBOT-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.
ROBOT-R1 achieves over 28% improvement in embodied reasoning tailored for low-level action control compared to SFT baselines
A 7B parameter model trained with ROBOT-R1 outperforms GPT-4o on low-level control reasoning tasks including spatial and movement reasoning
ROBOT-R1 achieves 31% improvement in task performance on EmbodiedBench Manipulation benchmark
On SpatialRGPT benchmark, ROBOT-R1 achieves approximately 40% improvement in quantitative metrics and 60% improvement in qualitative metrics
Multiple-choice question answering (MCQA) format enables more efficient learning compared to continuous action space prediction
Three auxiliary tasks (waypoint prediction, current state prediction, movement prediction) progressively improve overall performance
Models trained with ROBOT-R1 show consistent performance improvements across different random seeds
ROBOT-R1 trained models demonstrate better generalization to out-of-distribution tasks compared to SFT approaches
Reasoning patterns during ROBOT-R1 training show progressive shortening and refinement of responses rather than lengthening
GRPO algorithm with variance-reducing mechanisms proves more effective than REINFORCE++ for embodied reasoning tasks
ROBOT-R1 effectively transfers embodied reasoning capabilities to real-world robot environments despite training only on simulation
Metadata extraction including reference points, coordinate systems, and scale information is critical for LVLM performance
Explicit reasoning with <think></think> tags improves prediction accuracy and generalization
Performance on planning tasks slightly decreases because training focuses primarily on next keypoint prediction
SFT models with CoT guidance perform better than direct SFT but still show limited improvement in embodied reasoning
Models trained with ROBOT-R1 improve performance on VLABench M&T and Spatial tasks while potentially degrading on physics and complex reasoning tasks
High-level action reasoning capability emerges from training exclusively on low-level control information without explicit high-level supervision
ROBOT-R1 framework is efficient, requiring approximately 12 hours of training on four A100 GPUs for 5 epochs
LLM-as-judge evaluation shows Pearson correlations near 0.9 for high-level action, movement, and spatial reasoning tasks
ROBOT-R1 demonstrates effectiveness in cold-start scenarios where models are pre-trained with high-quality CoT SFT data before RL training
Embodied reasoning abilities acquired through ROBOT-R1 effectively transfer to downstream robot control tasks
On LIBERO simulation benchmark, ROBOT-R1 trained models show remarkable improvements particularly in Goal category tasks
Real robot experiments show ROBOT-R1 improves average pick-and-place success rate from 16.67% to 23.96%
Reward normalization mechanisms in GRPO lead to more stable learning compared to batch-level normalization
Reasoning response length naturally decreases during training as models transition from summary to narrative format
The discussion reveals several important insights about ROBOT-R1's approach and effectiveness. The framework successfully addresses key limitations of SFT-based training methods by using reinforcement learning to optimize reasoning pathways specifically for embodied control tasks. The study demonstrates that explicit reasoning processes, when properly optimized through RL, enable models to develop more generalized embodied reasoning capabilities that transfer effectively to downstream tasks and even real-world robot environments. The training process naturally produces progressively shorter and more focused reasoning traces, diverging from trends observed in mathematical reasoning tasks where longer chains emerge. The paper highlights that auxiliary tasks for state prediction and movement estimation significantly enhance learning efficiency compared to end-to-end approaches. Future work should explore extensions to include gripper manipulation and end-effector rotation, address differences in coordinate systems between simulation environments, and investigate complementary reward designs to ensure safe robot execution while maintaining human oversight. The work suggests that RL-optimized embodied reasoning could accelerate robotics research by making advanced reasoning capabilities accessible through smaller, more practical models.