Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

ICLR 2025, Singapore


1University of Illinois Chicago, 2Beijing Jiaotong University, 3University of Central Florida

Motivation: We introduce 3D intention grounding (right), a new task for detecting the target object using a 3D bounding box in a 3D scene, guided by human intention sentence (e.g., "I want something to support my back to relieve the pressure"). In contrast, the existing 3D visual grounding (left) relies on human reasoning and references for detection. The illustration clearly distinguishes that observation and reasoning are manually executed by human (left) and automated by AI (right).

Abstract

Humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back." Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human intention, it relies on humans to observe the scene, reason out the target that aligns with their intention ("pillow" in this case), and finally provide a reference to AI, such as "A pillow on the couch". Instead, 3D intention grounding challenges AI agents to automatically observe, reason and detect the desired target solely based on human intention. To tackle this challenge, we introduce the new Intent3D dataset, consisting of 44,990 intention texts associated with 209 fine-grained classes from 1,042 scenes of the ScanNet dataset. We also establish several baselines based on different language-based 3D object detection models on our benchmark. Finally, we propose IntentNet, our unique approach, designed to tackle this intention-based detection problem. It focuses on three key aspects: intention understanding, reasoning to identify object candidates, and cascaded adaptive learning that leverages the intrinsic priority logic of different losses for multiple objective optimization.

Method

Our benchmark dataset, Intent3D, is built through a structured pipeline: (1) Scene Graph Construction, where we organize scene information, including object categories and bounding boxes; (2) Object Selection, ensuring diverse, non-trivial, and unambiguous objects; (3) Text Generation, leveraging GPT-4 to produce intention descriptions without explicit object mentions, encouraging reasoning-based grounding; and (4) Data Cleaning, where we manually refine the generated texts to enhance quality and clarity. Intent3D provides a rich and challenging testbed for studying intention-driven object grounding in 3D environments.

We propose IntentNet to address 3D-IG, which requires 3D perception, intention understanding, and joint supervision. Our model extracts multimodal features using PointNet++, RoBERTa, and GroupFree for point cloud, text, and 3D object detection. An attention-based encoder fuses these features, followed by a decoder that refines queries for intention comprehension and box prediction. To enhance 3D understanding, we introduce Candidate Box Matching, aligning detected objects with intent. Verb-Object Alignment ensures the model captures intention semantics through contrastive learning. Finally, Cascaded Adaptive Learning optimizes training by structuring loss functions in a logical sequence, improving multimodal reasoning and performance.

Results

Our IntentNet achieves SOTA performance on Intent3D, outperforming prior methods by explicitly modeling intention language comprehension and reasoning over candidate boxes with cascaded optimization. On the validation set, it improves Top1-Acc@0.25 and Top1-Acc@0.5 by 11.22% and 8.05%, respectively, while boosting AP@0.25 and AP@0.5 by 9.12% and 5.43%. Similar gains are observed on the test set. Expert models, originally designed for referential language, struggle with intention language since they primarily align with nouns rather than verb-object relations, leading to inferior performance. Foundation models like 3D-VisTA benefit from broad multimodal pretraining but fall short due to their reliance on imperfect detector outputs, whereas IntentNet performs reasoning over candidate boxes, achieving better results despite using a less powerful detector. LLM-based models, such as Chat-3D-v2, perform the worst, as LLMs generally struggle with 3D-VG, and 3D-IG is even more complex. Their hallucination issues significantly impact AP metrics, though their pretrained detector and strong categorical reasoning help maintain decent Top1-Acc.

BibTeX


      @article{kang2024intent3d,
        title={Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention},
        author={Kang, Weitai and Qu, Mengxue and Kini, Jyoti and Wei, Yunchao and Shah, Mubarak and Yan, Yan},
        journal={arXiv preprint arXiv:2405.18295},
        year={2024}
      }