3D Vision Language Models (VLMs) for Robotic Manipulation: Opportunities and Challenges

Introduction

The intersection of 3D Vision-and-Language models (3D VLMs) in robotics presents a new frontier, blending spatial understanding with contextual reasoning. The Robo-3DVLM workshop seeks to explore the opportunities and challenges posed by integrating these technologies to enhance robot perception, decision-making, and interaction with the real world. As robots evolve to operate in increasingly complex environments, bridging the gap between 3D spatial reasoning and language understanding becomes critical. Key questions at the heart of this workshop include:

High-level vs. Low-level Representations: Is 3D vision crucial, or can 2D representations suffice for robotic tasks? How should robots interpret the world—through point clouds, 3D bounding boxes, or other output formats? What input modalities offer the most efficiency and generalization?
Pretraining for Policy Learning: Do low-level policies require extensive pretraining, or could projections from vision-language models serve as adequate features? Is it possible that distilled 2D features are sufficient for policy learning, or is a deeper, 3D-centric approach needed?
3D Vision-Language Action Models: What are the specific challenges in using 3D VLMs for robotic actions, particularly regarding sensor calibration and real-time performance?

By addressing these questions, the workshop aims to drive conversations around the utility of 3D in robotic vision, the role of language in perception, and the limitations imposed by current data and hardware constraints. Through invited talks and interactive sessions, we aim to unite researchers from diverse disciplines to push the boundaries of multimodal learning in robotics, setting the stage for the next generation of intelligent systems.

Call for Papers

We are excited to announce the Call for Papers for the Robo-3DVLM workshop. We invite original contributions presenting novel ideas, research, and applications relevant to the workshop’s theme.

Important Dates

Event	Date
Call for Papers	January 30th, 2025
Submission Deadline	May 16nd, 2025, 23:59 PST
Notification	May 20th, 2025
Camera-Ready	May 25th, 2025

Submission Guidelines

Page Limit: Submissions can be up to 4 pages for the main content. There is no limit on the number of pages for references or appendices.
Formatting: Submissions are encouraged to use the CVPR template.
Anonymity: All submissions must be anonymized. Please remove any author names, affiliations, or identifying information.
Relevant Work: We welcome references to recently published, relevant work (e.g., RSS, CoRL, ICRA, and ICML).
Archival Status: All accepted papers are non-archival.
Link: openreview submission

Accepted papers will be presented in the form of posters at the workshop. In addition, selected papers may be invited to deliver spotlight talks.

Paper topics

A non-exhaustive list of relevant topics:

3D Vision-Language Policy Learning
Pretraining for 3D Vision-Language Models
3D Representations for Policy Learning (i.e. NeRF, Gaussian Splatting, SDF)
3D Benchmarks and Simulatotion frameworks
3D Vision-Language Action Models
3D Vision-Language or Large-Language Models for Robotics
3D Instruction-tuning datasets for Robotics
3D pretraining datasets for Robotics
Other topics about 3D Vision-Language Models for Robotic Manipulations

Workshop Schedule (Tentative)

Start Time (CDT)	End Time (CDT)	Event
9:00 AM	9:10 AM	Opening remarks
9:10 AM	9:45 AM	Hao Su Exploring World Model for Robotic Manipulation
9:45 AM	10:20 AM	Chelsea Finn Pretraining and Posttraining Robotic Foundation Models
10:20 AM	10:55 AM	Ranjay Krishna Preparing perception for robotics
10:55 AM	11:10 AM	Coffee Break
11:10 AM	11:45 AM	Yunzhu Li Foundation Models for Structured Scene Modeling in Robotic Manipulation
11:45 AM	12:20 PM	Katerina Fragkiada 3D Generative Manipulation Policies: Bridging 2D Pre-training with 3D Scene Reasoning
12:20 PM	1:30 PM	Lunch
1:30 PM	2:00 PM	Poster Session (ExHall D, #357-#371
2:00 PM	2:35 PM	Angel Chang Building vision-language maps for embodied AI
2:35 PM	3:10 PM	Dieter Fox Hierarchical Action Models for Open-World 3D Policies
3:10 PM	3:25 PM	Coffee Break
3:25 PM	4:00 PM	Chuang Gan Genesis: An Unified and Generative Physics Simulation for Robotics
4:00 PM	4:45 PM	Spotlight Paper Talks (5 min talk / 2 min Q&A) • The One RING: A Robotic Indoor Navigation Generalist • Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models • Agentic Language-Grounded Adaptive Robotic Assembly • ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos
4:45 PM	5:00 PM	Ending Remarks and Paper Awards