Weitai Kang
I am a fourth-year Ph.D. student working with Prof. Yan Yan in Computer Science at the University of Illinois Chicago, expecting to graduate in 2027.
I am advancing the frontier of Multimodal Fine-Grained Understanding across image, GUI, 3D, and video domains. To achieve this, I focus on building Multimodal Large Language Models (Robin3D) with optimal paradigm design (ExpVG) and training strategies (GuirlVG). I explore how to scale higher-quality data (Robin3D), propose stronger supervision signals (AttBalance, SegVG), and establish better benchmarks (Intent3D). I further work on improving overall system efficiency (ACTRESS, 3DResT, INTP-Video-LLM), empowering AI agents (InfantAgent-Next), and making their decision-making mechanisms more interpretable (SaCo, TokenTM).
I have interned at Adobe, SonyAI, Tencent and SenseTime. I have been a Visiting Scholar at the University of Central Florida, working with Prof. Mubarak Shah. Before starting my PhD, I received my bachelor's degree in Mathematics from Sun Yat-sen University in 2022, where I was awarded the Outstanding Student Scholarship each year.
Email  / 
CV  / 
Google Scholar  / 
Linkedin  / 
Github  / 
Twitter  / 
Album
|
|
News
- [08/2025] Our paper, 3DResT, for Semi-Supervised 3D Visual Grounding is accepted to IEEE Transactions on Multimedia!!!
- [08/2025] I presented Robin3D at the Salesforce AI Research Future Forum at Salesforce Tower, SF on Aug. 14th.
- [08/2025] My first-author paper, ExpVG, for Visual Grounding design in MLLM is now available on arXiv.
- [08/2025] My first-author paper, GuirlVG, for GUI Visual Grounding is now available on arXiv.
- [06/2025] My first-author paper, AttBalance, is accepted to ACMMM 2025!!!
- [06/2025] My first-author paper, Robin3D, is accepted to ICCV 2025!!!
- [05/2025] I was interviewed by DeepTech (MIT Technology Review China) to share our InfantAgent-Next.
- [05/2025] My co-first-author paper, InfantAgent-Next, for AI Agent is now available on arXiv and Github.
- [04/2025] Our paper, 3DResT, for Semi-Supervised 3D RES is now available on arXiv.
- [01/2025] My first-author paper, Intent3D, is accepted to ICLR 2025!!!
- [01/2025] I transfer to the University of Illinois Chicago as a Ph.D. student, following my advisor, Prof. Yan Yan.
- [11/2024] Our paper, Infant Agent, for AI Agent is now available on arXiv.
- [10/2024] My first-author paper, Robin3D, for 3D LLM is now available on arXiv.
- [09/2024] Our paper, INTP-Video-LLM, for Video LLM is now available on arXiv.
- [07/2024] My first-author paper, SegVG, is accepted to ECCV 2024!!! The code is now open-sourced.
- [04/2024] Our paper, SaCo, for Transformer Explainability is accepted to CVPR 2024.
- [03/2024] Our paper, TokenTM, for Transformer Explainability is accepted to CVPR 2024.
- [02/2024] My first-author paper, Intent3D, for 3D Intention Grounding is now available on arXiv.
- [10/2023] My first-author paper, ACTRESS, for Visual Grounding is now available on arXiv.
- [08/2023] I am a Teaching Assistant of CS 577: Deep Learning at Illinois Institute of Technology.
- [04/2023] My first-author paper, SegVG, for Visual Grounding is now available on arXiv.
- [01/2023] My first-author paper, AttBalance, for Visual Grounding constraint is now available on arXiv.
- [08/2022] I join Prof. Yan Yan's group as a Ph.D. student.
|
|
ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu
PDF
|
|
GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, Yan Yan
PDF
|
|
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan
ICCV, 2025
PDF /
Code
|
|
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention
Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan
ICLR, 2025
Project Page /
PDF /
Code
|
|
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
Bin Lei*, Weitai Kang*, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding
* Equal contribution
PDF /
Code
|
|
SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding
Weitai Kang, Gaowen Liu, Mubarak Shah, Yan Yan
ECCV, 2024
PDF /
Code
|
|
AttBalance: Visual Grounding with Attention-Driven Constraint Balancing
Weitai Kang, Luowei Zhou, Junyi Wu, Changchang Sun, Yan Yan
ACM MM, 2025
PDF
|
|
Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner
Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan
PDF
|
|
ACTRESS: Active Retraining for Semi-supervised Visual Grounding
Weitai Kang, Mengxue Qu, Yunchao Wei, Yan Yan
PDF
|
|
3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation
Wenxin Chen, Mengxue Qu, Weitai Kang, Yan Yan, Yao Zhao, Yunchao Wei
IEEE Transactions on Multimedia
PDF
|
|
Infant Agent: A Tool-Integrated, Logic-Driven Agent with Cost-Effective API Usage
Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, Qiuwu Chen
PDF
|
|
On the Faithfulness of Vision Transformer Explanations
Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, Yan Yan
CVPR, 2024
PDF
|
|
Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer
Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan
CVPR, 2024
PDF
|
|
Adobe · Research Internship
Research on Large Multimodal Model.
May 2025 - Aug. 2025, San Jose, California, United States · On-site
Aug 2025 - Dec. 2025, Chicago, Illinois, United States · Remote
|
|
SonyAI · Research Internship
Research on 2D Large Multimodal Model.
Oct. 2024 - Dec. 2024, Chicago, Illinois, United States · Remote
|
|
Tencent · Machine Learning Engineer Internship
Work on Human Pose Detection.
Oct. 2021 - Jul. 2022, Shenzhen, Guangdong, China · On-site
|
|
SenseTime · Research Internship
Research on Video Super-Resolution.
Jul. 2021 - Sep. 2021, Shenzhen, Guangdong, China · On-site
|
You can also reach me through WeChat: Victor_Hong_
|
|