The first CVPR workshop on

3D Vision Language Models (VLMs) for Robotic Manipulation: Opportunities and Challenges

June 11, 2025, Nashville, TN. Location: 101 A

Introduction

The intersection of 3D Vision-and-Language models (3D VLMs) in robotics presents a new frontier, blending spatial understanding with contextual reasoning. The Robo-3DVLM workshop seeks to explore the opportunities and challenges posed by integrating these technologies to enhance robot perception, decision-making, and interaction with the real world. As robots evolve to operate in increasingly complex environments, bridging the gap between 3D spatial reasoning and language understanding becomes critical. Key questions at the heart of this workshop include:

By addressing these questions, the workshop aims to drive conversations around the utility of 3D in robotic vision, the role of language in perception, and the limitations imposed by current data and hardware constraints. Through invited talks and interactive sessions, we aim to unite researchers from diverse disciplines to push the boundaries of multimodal learning in robotics, setting the stage for the next generation of intelligent systems.

Call for Papers

We are excited to announce the Call for Papers for the Robo-3DVLM workshop. We invite original contributions presenting novel ideas, research, and applications relevant to the workshop’s theme.

Important Dates

Event Date
Call for Papers January 30th, 2025
Submission Deadline May 16nd, 2025, 23:59 PST
Notification May 20th, 2025
Camera-Ready May 25th, 2025

Submission Guidelines

Paper topics

A non-exhaustive list of relevant topics:

Workshop Schedule (Tentative)

Start Time (CDT) End Time (CDT) Event
9:00 AM 9:10 AM Opening remarks
9:10 AM 9:45 AM Hao Su
Exploring World Model for Robotic Manipulation
9:45 AM 10:20 AM Chelsea Finn
Pretraining and Posttraining Robotic Foundation Models
10:20 AM 10:55 AM Ranjay Krishna
Preparing perception for robotics
10:55 AM 11:10 AM Coffee Break
11:10 AM 11:45 AM Yunzhi Li
Foundation Models for Structured Scene Modeling in Robotic Manipulation
11:45 AM 12:20 PM Katerina Fragkiada
3D Generative Manipulation Policies: Bridging 2D Pre-training with 3D Scene Reasoning
12:20 PM 1:30 PM Lunch
1:30 PM 2:00 PM Poster Session (ExHall D, #357-#371
2:00 PM 2:35 PM Angel Chang
Building vision-language maps for embodied AI
2:35 PM 3:10 PM Dieter Fox
Hierarchical Action Models for Open-World 3D Policies
3:10 PM 3:25 PM Coffee Break
3:25 PM 4:00 PM Chuang Gan
Genesis: An Unified and Generative Physics Simulation for Robotics
4:00 PM 4:45 PM Spotlight Paper Talks (5 min talk / 2 min Q&A)
• The One RING: A Robotic Indoor Navigation Generalist
• Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models
• Agentic Language-Grounded Adaptive Robotic Assembly
• ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos
4:45 PM 5:00 PM Ending Remarks and Paper Awards
The website template is borrowed from here.
For inquiries, contact us at: robo-3dvlm@googlegroups.com