Xeuron

AI Metadata Extraction

Extract authors, key findings, references, and an executive summary using AI.

Version:

Extraction v2anthropic/claude-haiku-4-5-202510014/28/2026

Executive Summary

ROBOT-R1 presents a novel framework that leverages reinforcement learning to enhance embodied reasoning capabilities in Large Vision-Language Models for robotic control. Unlike conventional Supervised Fine-Tuning approaches that suffer from catastrophic forgetting and limited generalization, ROBOT-R1 trains models to predict next keypoint states through explicit reasoning processes optimized via the GRPO algorithm. The framework reformulates continuous state prediction as multiple-choice question-answering tasks, making the learning process more tractable while incorporating auxiliary tasks for current state and movement prediction. Despite using only 7 billion parameters, ROBOT-R1-trained models exceed the performance of larger commercial models including GPT-4o on spatial and movement reasoning tasks critical for low-level robot control. The paper introduces ROBOT-R1 Bench, a comprehensive benchmark featuring 215 open-ended questions evaluating four key reasoning types: planning, high-level action reasoning, movement reasoning, and spatial reasoning. Extensive evaluations across multiple benchmarks demonstrate ROBOT-R1's effectiveness: 28% improvement on embodied reasoning tasks, 31% improvement on EmbodiedBench Manipulation, and substantial gains on spatial reasoning benchmarks. Crucially, the learned reasoning capabilities transfer effectively to downstream tasks, real-world robot environments, and other robotic benchmarks, whereas SFT-trained models often show degradation on out-of-distribution tasks. The research reveals that embodied reasoning patterns during RL training naturally evolve toward shorter, more focused reasoning traces compared to the longer chains observed in mathematical reasoning domains. Models transition from summary-format reasoning to narrative-format responses that coherently connect reasoning components directly to task-relevant information. Ablation studies confirm the critical importance of auxiliary tasks and the MCQA framework design, while the approach remains robust across different random seeds and compatible with cold-start scenarios using pre-trained CoT data. Beyond embodied reasoning evaluation, ROBOT-R1 demonstrates practical impact on robot control performance, improving pick-and-place success rates on real robots from 16.67% to 23.96%. The framework is computationally efficient, requiring only 12 hours of training on four A100 GPUs, making it accessible for research labs and smaller organizations. These findings suggest that RL-optimized embodied reasoning represents a promising direction for developing practical robot intelligence that combines the reasoning capabilities of large models with task-specific optimization.

Authors

Dongyoung KimPrimary

KAIST

kingdy2002@kaist.ac.kr

Sumin Park

KAIST

Huiwon Jang

KAIST

Jinwoo Shin

KAIST

Jaehyung Kim

Yonsei University

Younggyo Seo

UC Berkeley

Abstract

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce ROBOT-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. ROBOT-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, ROBOT-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate ROBOT-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, ROBOT-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.

Key Findings (25)

1
ROBOT-R1 achieves over 28% improvement in embodied reasoning tailored for low-level action control compared to SFT baselines
2
A 7B parameter model trained with ROBOT-R1 outperforms GPT-4o on low-level control reasoning tasks including spatial and movement reasoning
3
ROBOT-R1 achieves 31% improvement in task performance on EmbodiedBench Manipulation benchmark
4
On SpatialRGPT benchmark, ROBOT-R1 achieves approximately 40% improvement in quantitative metrics and 60% improvement in qualitative metrics
5
Multiple-choice question answering (MCQA) format enables more efficient learning compared to continuous action space prediction
6
Three auxiliary tasks (waypoint prediction, current state prediction, movement prediction) progressively improve overall performance
7
Models trained with ROBOT-R1 show consistent performance improvements across different random seeds
8
ROBOT-R1 trained models demonstrate better generalization to out-of-distribution tasks compared to SFT approaches
9
Reasoning patterns during ROBOT-R1 training show progressive shortening and refinement of responses rather than lengthening
10
GRPO algorithm with variance-reducing mechanisms proves more effective than REINFORCE++ for embodied reasoning tasks
11
ROBOT-R1 effectively transfers embodied reasoning capabilities to real-world robot environments despite training only on simulation
12
Metadata extraction including reference points, coordinate systems, and scale information is critical for LVLM performance
13
Explicit reasoning with <think></think> tags improves prediction accuracy and generalization
14
Performance on planning tasks slightly decreases because training focuses primarily on next keypoint prediction
15
SFT models with CoT guidance perform better than direct SFT but still show limited improvement in embodied reasoning
16
Models trained with ROBOT-R1 improve performance on VLABench M&T and Spatial tasks while potentially degrading on physics and complex reasoning tasks
17
High-level action reasoning capability emerges from training exclusively on low-level control information without explicit high-level supervision
18
ROBOT-R1 framework is efficient, requiring approximately 12 hours of training on four A100 GPUs for 5 epochs
19
LLM-as-judge evaluation shows Pearson correlations near 0.9 for high-level action, movement, and spatial reasoning tasks
20
ROBOT-R1 demonstrates effectiveness in cold-start scenarios where models are pre-trained with high-quality CoT SFT data before RL training
21
Embodied reasoning abilities acquired through ROBOT-R1 effectively transfer to downstream robot control tasks
22
On LIBERO simulation benchmark, ROBOT-R1 trained models show remarkable improvements particularly in Goal category tasks
23
Real robot experiments show ROBOT-R1 improves average pick-and-place success rate from 16.67% to 23.96%
24
Reward normalization mechanisms in GRPO lead to more stable learning compared to batch-level normalization
25
Reasoning response length naturally decreases during training as models transition from summary to narrative format

Discussion & Future Directions

The discussion reveals several important insights about ROBOT-R1's approach and effectiveness. The framework successfully addresses key limitations of SFT-based training methods by using reinforcement learning to optimize reasoning pathways specifically for embodied control tasks. The study demonstrates that explicit reasoning processes, when properly optimized through RL, enable models to develop more generalized embodied reasoning capabilities that transfer effectively to downstream tasks and even real-world robot environments. The training process naturally produces progressively shorter and more focused reasoning traces, diverging from trends observed in mathematical reasoning tasks where longer chains emerge. The paper highlights that auxiliary tasks for state prediction and movement estimation significantly enhance learning efficiency compared to end-to-end approaches. Future work should explore extensions to include gripper manipulation and end-effector rotation, address differences in coordinate systems between simulation environments, and investigate complementary reward designs to ensure safe robot execution while maintaining human oversight. The work suggests that RL-optimized embodied reasoning could accelerate robotics research by making advanced reasoning capabilities accessible through smaller, more practical models.

References (59)

[1]Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740.
[2]Anthropic. (2024). Claude 3.5 haiku. Anthropic Blog. https://www.anthropic.com/claude/haiku
[3]Anthropic. (2024). Claude 3.5 sonnet. Anthropic Blog. https://www.anthropic.com/news/claude-3-5-sonnet
[4]Anthropic. (2024). Claude 3.7 sonnet. Anthropic Blog. https://www.anthropic.com/news/claude-3-7-sonnet
[5]Anthropic. (2024). Introducing the claude 3 family: State-of-the-art models in intelligence, speed, and vision. Anthropic Blog. https://www.anthropic.com/news/claude-3-family
[6]Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923.
[7]Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164.
[8]Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. (2022). Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.
[9]Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. (2023). Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
[10]Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., & Xia, F. (2024). Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14455–14465).
[11]Cheng, A.-C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., & Liu, S. (2024). Spatialrgpt: Grounded spatial reasoning in vision language models. arXiv preprint arXiv:2406.01584.
[12]Clark, J., Mirchandani, S., Sadigh, D., & Belkhale, S. (2025). Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729.
[13]Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018). Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–10).
[14]Driess, D., Xia, F., Sajjadi, S. M., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al. (2023). Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
[15]Gemini Team. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
[16]Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
[17]Gemini Team, Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
[18]Gemini Team, Hassabis, D., & Kavukcuoglu, K. (2024). Gemini 2.0: Google's new ai model for the agentic era. Google Blog. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/
[19]Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
[20]Hu, J., Liu, J. K., Xu, H., & Shen, W. (2025). Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. arXiv preprint arXiv:2501.03262.
[21]Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., & Lin, S. (2025). Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749.
[22]Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., & Fei-Fei, L. (2023). Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
[23]Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A. J., Welihinda, A., Hayes, A., Radford, A., et al. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.
[24]Hu, Y., Lin, F., Zhang, T., Yi, L., & Gao, Y. (2023). Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842.
[25]James, S., Ma, Z., Arrojo, D. R., & Davison, A. J. (2020). Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2), 3019–3026.
[26]James, S., & Davison, A. J. (2022). Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2), 1612–1619.
[27]Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., & Han, J. (2025). Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
[28]Kim, K. Team, Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. (2025). Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599.
[29]Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al. (2024). Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246.
[30]Kim, S., Joo, S. J., Kim, D., Jang, J., Ye, S., Shin, J., & Seo, M. (2023). The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045.
[31]Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. (2024). Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650.
[32]Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Liu, H., & Gan, C. (2023). Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378.
[33]Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., & Zeng, A. (2023). Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 9493–9500). IEEE.
[34]Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., & Stone, P. (2023). Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36, 44776–44791.
[35]Lu, Z., Chai, Y., Guo, Y., Yin, X., Liu, L., Wang, H., Xiong, G., & Li, H. (2025). Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620.
[36]Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., & Florence, P. (2023). Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters.
[37]Mees, O., Hermann, L., Rosete-Beas, E., & Burgard, W. (2022). Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3), 7327–7334.
[38]Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393.
[39]OpenAI. (2024). Introducing gpt-4.1. OpenAI Blog. https://openai.com/index/gpt-4-1/
[40]Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakrishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N. J., et al. (2024). Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 645–652). IEEE.
[41]Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., et al. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
[42]Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al. (2025). Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615.
[43]Shentu, Y., Wu, P., Rajeswaran, A., & Abbeel, P. (2024). From llms to actions: latent codes as bridges in hierarchical robot control. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 8539–8546). IEEE.
[44]Shi, L. X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al. (2025). Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417.
[45]Wang, J., Shi, E., Hu, H., Ma, C., Liu, Y., Wang, X., Yao, Y., Liu, X., Ge, B., & Zhang, S. (2024). Large language models for robotics: Opportunities, challenges, and perspectives. Journal of Automation and Intelligence.
[46]Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
[47]Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824–24837.
[48]Wen, J., Zhu, M., Zhu, Y., Tang, Z., Li, J., Zhou, Z., Li, C., Liu, X., Peng, Y., Shen, C., et al. (2024). Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression. arXiv preprint arXiv:2412.03293.
[49]Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., et al. (2024). Latent action pretraining from videos. arXiv preprint arXiv:2410.11758.
[50]Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., & Abbeel, P. (2023). Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114.
[51]Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36, 11809–11822.
[52]Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y., et al. (2024). Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078.
[53]Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Fan, T., Liu, G., Liu, L., Liu, X., et al. (2025). Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476.
[54]Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., & Levine, S. (2024). Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693.
[55]Zeng, W., Huang, Y., Liu, Q., Liu, W., He, K., Ma, Z., & He, J. (2025). Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892.
[56]Zhang, S., Xu, Z., Liu, P., Yu, X., Li, Y., Gao, Q., Fei, Z., Yin, Z., Wu, Z., & Jiang, Y.-G. (2025). Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 11142–11152).
[57]Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., & Gan, C. (2024). 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631.
[58]Zheng, K., Chen, X., Jenkins, O. C., & Wang, X. (2022). Vlmbench: A compositional benchmark for vision-and-language manipulation. Advances in Neural Information Processing Systems, 35, 665–678.
[59]Zhou, Z., Zhu, Y., Zhu, M., Wen, J., Liu, N., Xu, Z., Meng, W., Cheng, R., Peng, Y., Shen, C., et al. (2025). Chatvla: Unified multimodal understanding and robot control with vision-language-action model. arXiv preprint arXiv:2502.14420.