• Xeuron logo
Discover
  • Home
  • Popular
  • Hot & Trending
  • Explore
  • My Extractions
Create
  • SubXeurons
    • iPSC-Cardio Cells
    • HALO: A Unified Visio
  • Publications
    • Self-organizing human heart assembloids with autologous and developmentally relevant cardiac neural crest-derived tissues
    • Path Planning of Cleaning Robot with Reinforcement Learning
    • Reinforcement Learning Approaches in Social Robotics
    • Robotic Packaging Optimization with Reinforcement Learning
    • A Concise Introduction to Reinforcement Learning in Robotics
    • Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
    • Robotic Surgery With Lean Reinforcement Learning
    • Residual Reinforcement Learning for Robot Control
    • Autonomous robotic nanofabrication with reinforcement learning
    • Heterogeneous Multi-Robot Reinforcement Learning
    • Robot Air Hockey: A Manipulation Testbed for Robot Learning with Reinforcement Learning
    • Reinforcement learning for freeform robot design
    • Geometric Reinforcement Learning For Robotic Manipulation
    • On-Robot Bayesian Reinforcement Learning for POMDPs
    • Efficient Content-Based Sparse Attention with Routing Transformers
    • A foundation model of transcription across human cell types
    • Transformer AI
    • HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidan
    • HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
  • Events
    • No events yet
HomeSearchEventsProfileCreate
Preprint[2025]

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

xeuron.com/p/robot-r1-reinforcement-learning-for-enhanced-embodied-reasoning-in-robotics·Source·PDF

AI Summary

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.

AI Metadata Extraction

Extract authors, key findings, references, and an executive summary using AI.

Version:
Extraction v2anthropic/claude-haiku-4-5-202510014/28/2026

Executive Summary

ROBOT-R1 presents a novel framework that leverages reinforcement learning to enhance embodied reasoning capabilities in Large Vision-Language Models for robotic control. Unlike conventional Supervised Fine-Tuning approaches that suffer from catastrophic forgetting and limited generalization, ROBOT-R1 trains models to predict next keypoint states through explicit reasoning processes optimized via the GRPO algorithm. The framework reformulates continuous state prediction as multiple-choice question-answering tasks, making the learning process more tractable while incorporating auxiliary tasks for current state and movement prediction. Despite using only 7 billion parameters, ROBOT-R1-trained models exceed the performance of larger commercial models including GPT-4o on spatial and movement reasoning tasks critical for low-level robot control. The paper introduces ROBOT-R1 Bench, a comprehensive benchmark featuring 215 open-ended questions evaluating four key reasoning types: planning, high-level action reasoning, movement reasoning, and spatial reasoning. Extensive evaluations across multiple benchmarks demonstrate ROBOT-R1's effectiveness: 28% improvement on embodied reasoning tasks, 31% improvement on EmbodiedBench Manipulation, and substantial gains on spatial reasoning benchmarks. Crucially, the learned reasoning capabilities transfer effectively to downstream tasks, real-world robot environments, and other robotic benchmarks, whereas SFT-trained models often show degradation on out-of-distribution tasks. The research reveals that embodied reasoning patterns during RL training naturally evolve toward shorter, more focused reasoning traces compared to the longer chains observed in mathematical reasoning domains. Models transition from summary-format reasoning to narrative-format responses that coherently connect reasoning components directly to task-relevant information. Ablation studies confirm the critical importance of auxiliary tasks and the MCQA framework design, while the approach remains robust across different random seeds and compatible with cold-start scenarios using pre-trained CoT data. Beyond embodied reasoning evaluation, ROBOT-R1 demonstrates practical impact on robot control performance, improving pick-and-place success rates on real robots from 16.67% to 23.96%. The framework is computationally efficient, requiring only 12 hours of training on four A100 GPUs, making it accessible for research labs and smaller organizations. These findings suggest that RL-optimized embodied reasoning represents a promising direction for developing practical robot intelligence that combines the reasoning capabilities of large models with task-specific optimization.

Authors

Dongyoung KimPrimary

KAIST

kingdy2002@kaist.ac.kr

Sumin Park

KAIST

Huiwon Jang

KAIST

Jinwoo Shin

KAIST

Jaehyung Kim

Yonsei University

Younggyo Seo

UC Berkeley

Abstract

Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce ROBOT-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. ROBOT-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, ROBOT-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate ROBOT-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, ROBOT-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.

Key Findings (25)

  1. 1

    ROBOT-R1 achieves over 28% improvement in embodied reasoning tailored for low-level action control compared to SFT baselines

  2. 2

    A 7B parameter model trained with ROBOT-R1 outperforms GPT-4o on low-level control reasoning tasks including spatial and movement reasoning

  3. 3

    ROBOT-R1 achieves 31% improvement in task performance on EmbodiedBench Manipulation benchmark

  4. 4

    On SpatialRGPT benchmark, ROBOT-R1 achieves approximately 40% improvement in quantitative metrics and 60% improvement in qualitative metrics

  5. 5

    Multiple-choice question answering (MCQA) format enables more efficient learning compared to continuous action space prediction

  6. 6

    Three auxiliary tasks (waypoint prediction, current state prediction, movement prediction) progressively improve overall performance

  7. 7

    Models trained with ROBOT-R1 show consistent performance improvements across different random seeds

  8. 8

    ROBOT-R1 trained models demonstrate better generalization to out-of-distribution tasks compared to SFT approaches

  9. 9

    Reasoning patterns during ROBOT-R1 training show progressive shortening and refinement of responses rather than lengthening

  10. 10

    GRPO algorithm with variance-reducing mechanisms proves more effective than REINFORCE++ for embodied reasoning tasks

  11. 11

    ROBOT-R1 effectively transfers embodied reasoning capabilities to real-world robot environments despite training only on simulation

  12. 12

    Metadata extraction including reference points, coordinate systems, and scale information is critical for LVLM performance

  13. 13

    Explicit reasoning with <think></think> tags improves prediction accuracy and generalization

  14. 14

    Performance on planning tasks slightly decreases because training focuses primarily on next keypoint prediction

  15. 15

    SFT models with CoT guidance perform better than direct SFT but still show limited improvement in embodied reasoning

  16. 16

    Models trained with ROBOT-R1 improve performance on VLABench M&T and Spatial tasks while potentially degrading on physics and complex reasoning tasks

  17. 17

    High-level action reasoning capability emerges from training exclusively on low-level control information without explicit high-level supervision

  18. 18

    ROBOT-R1 framework is efficient, requiring approximately 12 hours of training on four A100 GPUs for 5 epochs

  19. 19

    LLM-as-judge evaluation shows Pearson correlations near 0.9 for high-level action, movement, and spatial reasoning tasks

  20. 20

    ROBOT-R1 demonstrates effectiveness in cold-start scenarios where models are pre-trained with high-quality CoT SFT data before RL training

  21. 21

    Embodied reasoning abilities acquired through ROBOT-R1 effectively transfer to downstream robot control tasks

  22. 22

    On LIBERO simulation benchmark, ROBOT-R1 trained models show remarkable improvements particularly in Goal category tasks

  23. 23

    Real robot experiments show ROBOT-R1 improves average pick-and-place success rate from 16.67% to 23.96%

  24. 24

    Reward normalization mechanisms in GRPO lead to more stable learning compared to batch-level normalization

  25. 25

    Reasoning response length naturally decreases during training as models transition from summary to narrative format

Discussion & Future Directions

The discussion reveals several important insights about ROBOT-R1's approach and effectiveness. The framework successfully addresses key limitations of SFT-based training methods by using reinforcement learning to optimize reasoning pathways specifically for embodied control tasks. The study demonstrates that explicit reasoning processes, when properly optimized through RL, enable models to develop more generalized embodied reasoning capabilities that transfer effectively to downstream tasks and even real-world robot environments. The training process naturally produces progressively shorter and more focused reasoning traces, diverging from trends observed in mathematical reasoning tasks where longer chains emerge. The paper highlights that auxiliary tasks for state prediction and movement estimation significantly enhance learning efficiency compared to end-to-end approaches. Future work should explore extensions to include gripper manipulation and end-effector rotation, address differences in coordinate systems between simulation environments, and investigate complementary reward designs to ensure safe robot execution while maintaining human oversight. The work suggests that RL-optimized embodied reasoning could accelerate robotics research by making advanced reasoning capabilities accessible through smaller, more practical models.

References (59)

  1. [1]Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Pietquin, O., Üstün, A., & Hooker, S. (2024). Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740.
  2. [2]Anthropic. (2024). Claude 3.5 haiku. Anthropic Blog. https://www.anthropic.com/claude/haiku
  3. [3]Anthropic. (2024). Claude 3.5 sonnet. Anthropic Blog. https://www.anthropic.com/news/claude-3-5-sonnet
  4. [4]Anthropic. (2024). Claude 3.7 sonnet. Anthropic Blog. https://www.anthropic.com/news/claude-3-7-sonnet
  5. [5]Anthropic. (2024). Introducing the claude 3 family: State-of-the-art models in intelligence, speed, and vision. Anthropic Blog. https://www.anthropic.com/news/claude-3-family
  6. [6]Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. (2025). Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923.
  7. [7]Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. (2024). π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164.
  8. [8]Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. (2022). Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817.
  9. [9]Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al. (2023). Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818.
  10. [10]Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., & Xia, F. (2024). Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14455–14465).
  11. [11]Cheng, A.-C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., & Liu, S. (2024). Spatialrgpt: Grounded spatial reasoning in vision language models. arXiv preprint arXiv:2406.01584.
  12. [12]Clark, J., Mirchandani, S., Sadigh, D., & Belkhale, S. (2025). Action-free reasoning for policy generalization. arXiv preprint arXiv:2502.03729.
  13. [13]Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018). Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–10).
  14. [14]Driess, D., Xia, F., Sajjadi, S. M., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., et al. (2023). Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  15. [15]Gemini Team. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  16. [16]Gemini Team, Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. (2023). Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  17. [17]Gemini Team, Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  18. [18]Gemini Team, Hassabis, D., & Kavukcuoglu, K. (2024). Gemini 2.0: Google's new ai model for the agentic era. Google Blog. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/
  19. [19]Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
  20. [20]Hu, J., Liu, J. K., Xu, H., & Shen, W. (2025). Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models. arXiv preprint arXiv:2501.03262.
  21. [21]Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., & Lin, S. (2025). Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749.
  22. [22]Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., & Fei-Fei, L. (2023). Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
  23. [23]Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A. J., Welihinda, A., Hayes, A., Radford, A., et al. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.
  24. [24]Hu, Y., Lin, F., Zhang, T., Yi, L., & Gao, Y. (2023). Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842.
  25. [25]James, S., Ma, Z., Arrojo, D. R., & Davison, A. J. (2020). Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2), 3019–3026.
  26. [26]James, S., & Davison, A. J. (2022). Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2), 1612–1619.
  27. [27]Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., & Han, J. (2025). Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.
  28. [28]Kim, K. Team, Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. (2025). Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599.
  29. [29]Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al. (2024). Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246.
  30. [30]Kim, S., Joo, S. J., Kim, D., Jang, J., Ye, S., Shin, J., & Seo, M. (2023). The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045.
  31. [31]Li, Q., Liang, Y., Wang, Z., Luo, L., Chen, X., Liao, M., Wei, F., Deng, Y., Xu, S., Zhang, Y., et al. (2024). Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650.
  32. [32]Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Liu, H., & Gan, C. (2023). Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378.
  33. [33]Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., & Zeng, A. (2023). Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) (pp. 9493–9500). IEEE.
  34. [34]Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., & Stone, P. (2023). Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36, 44776–44791.
  35. [35]Lu, Z., Chai, Y., Guo, Y., Yin, X., Liu, L., Wang, H., Xiong, G., & Li, H. (2025). Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620.
  36. [36]Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J., Baruch, R., Armstrong, T., & Florence, P. (2023). Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters.
  37. [37]Mees, O., Hermann, L., Rosete-Beas, E., & Burgard, W. (2022). Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7(3), 7327–7334.
  38. [38]Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., & Hashimoto, T. (2025). s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393.
  39. [39]OpenAI. (2024). Introducing gpt-4.1. OpenAI Blog. https://openai.com/index/gpt-4-1/
  40. [40]Sermanet, P., Ding, T., Zhao, J., Xia, F., Dwibedi, D., Gopalakrishnan, K., Chan, C., Dulac-Arnold, G., Maddineni, S., Joshi, N. J., et al. (2024). Robovqa: Multimodal long-horizon reasoning for robotics. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 645–652). IEEE.
  41. [41]Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., et al. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
  42. [42]Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., et al. (2025). Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615.
  43. [43]Shentu, Y., Wu, P., Rajeswaran, A., & Abbeel, P. (2024). From llms to actions: latent codes as bridges in hierarchical robot control. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 8539–8546). IEEE.
  44. [44]Shi, L. X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al. (2025). Hi robot: Open-ended instruction following with hierarchical vision-language-action models. arXiv preprint arXiv:2502.19417.
  45. [45]Wang, J., Shi, E., Hu, H., Ma, C., Liu, Y., Wang, X., Yao, Y., Liu, X., Ge, B., & Zhang, S. (2024). Large language models for robotics: Opportunities, challenges, and perspectives. Journal of Automation and Intelligence.
  46. [46]Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  47. [47]Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824–24837.
  48. [48]Wen, J., Zhu, M., Zhu, Y., Tang, Z., Li, J., Zhou, Z., Li, C., Liu, X., Peng, Y., Shen, C., et al. (2024). Diffusion-vla: Scaling robot foundation models via unified diffusion and autoregression. arXiv preprint arXiv:2412.03293.
  49. [49]Ye, S., Jang, J., Jeon, B., Joo, S., Yang, J., Peng, B., Mandlekar, A., Tan, R., Chao, Y.-W., Lin, B. Y., et al. (2024). Latent action pretraining from videos. arXiv preprint arXiv:2410.11758.
  50. [50]Yang, M., Du, Y., Ghasemipour, K., Tompson, J., Schuurmans, D., & Abbeel, P. (2023). Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114.
  51. [51]Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36, 11809–11822.
  52. [52]Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y., et al. (2024). Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078.
  53. [53]Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Fan, T., Liu, G., Liu, L., Liu, X., et al. (2025). Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476.
  54. [54]Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., & Levine, S. (2024). Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693.
  55. [55]Zeng, W., Huang, Y., Liu, Q., Liu, W., He, K., Ma, Z., & He, J. (2025). Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. arXiv preprint arXiv:2503.18892.
  56. [56]Zhang, S., Xu, Z., Liu, P., Yu, X., Li, Y., Gao, Q., Fei, Z., Yin, Z., Wu, Z., & Jiang, Y.-G. (2025). Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 11142–11152).
  57. [57]Zhen, H., Qiu, X., Chen, P., Yang, J., Yan, X., Du, Y., Hong, Y., & Gan, C. (2024). 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631.
  58. [58]Zheng, K., Chen, X., Jenkins, O. C., & Wang, X. (2022). Vlmbench: A compositional benchmark for vision-and-language manipulation. Advances in Neural Information Processing Systems, 35, 665–678.
  59. [59]Zhou, Z., Zhu, Y., Zhu, M., Wen, J., Liu, N., Xu, Z., Meng, W., Cheng, R., Peng, Y., Shen, C., et al. (2025). Chatvla: Unified multimodal understanding and robot control with vision-language-action model. arXiv preprint arXiv:2502.14420.