Weitai Kang

Weitai Kang

I am a fourth-year Ph.D. student working with Prof. Yan Yan in Computer Science at the University of Illinois Chicago, expecting to graduate in 2027.

I am advancing the frontier of Multimodal Fine-Grained Understanding across image, GUI, 3D, and video domains. To achieve this, I focus on building Multimodal Large Language Models (Robin3D) with optimal paradigm design (ExpVG) and training strategies (GuirlVG). I explore how to scale higher-quality data (Robin3D), propose stronger supervision signals (AttBalance, SegVG), and establish better benchmarks (Intent3D). I further work on improving overall system efficiency (ACTRESS, 3DResT, INTP-Video-LLM), empowering AI agents (InfantAgent-Next), and making their decision-making mechanisms more interpretable (SaCo, TokenTM).

I have interned at Adobe, SonyAI, Tencent and SenseTime. I have been a Visiting Scholar at the University of Central Florida, working with Prof. Mubarak Shah. Before starting my PhD, I received my bachelor's degree in Mathematics from Sun Yat-sen University in 2022, where I was awarded the Outstanding Student Scholarship each year.

Email / CV / Google Scholar / Linkedin / Github / Twitter / Hi~

News

[08/2025] My (co-)first-author paper, InfantAgent-Next, is accepted to NeurIPS 2025!!!
[08/2025] Our paper, 3DResT, for Semi-Supervised 3D Visual Grounding is accepted to IEEE Transactions on Multimedia!!!
[08/2025] I presented Robin3D at the Salesforce AI Research Future Forum at Salesforce Tower, SF on Aug. 14th.
[08/2025] My first-author paper, ExpVG, for Visual Grounding design in MLLM is now available on arXiv.
[08/2025] My first-author paper, GuirlVG, for GUI Visual Grounding is now available on arXiv.
[06/2025] My first-author paper, AttBalance, is accepted to ACMMM 2025!!!
[06/2025] My first-author paper, Robin3D, is accepted to ICCV 2025!!!
[05/2025] I was interviewed by DeepTech (MIT Technology Review China) to share our InfantAgent-Next.
[05/2025] My co-first-author paper, InfantAgent-Next, for AI Agent is now available on arXiv and Github.
[04/2025] Our paper, 3DResT, for Semi-Supervised 3D RES is now available on arXiv.
[01/2025] My first-author paper, Intent3D, is accepted to ICLR 2025!!!
[01/2025] I transfer to the University of Illinois Chicago as a Ph.D. student, following my advisor, Prof. Yan Yan.
[11/2024] Our paper, Infant Agent, for AI Agent is now available on arXiv.
[10/2024] My first-author paper, Robin3D, for 3D LLM is now available on arXiv.
[09/2024] Our paper, INTP-Video-LLM, for Video LLM is now available on arXiv.
[07/2024] My first-author paper, SegVG, is accepted to ECCV 2024!!! The code is now open-sourced.
[04/2024] Our paper, SaCo, for Transformer Explainability is accepted to CVPR 2024.
[03/2024] Our paper, TokenTM, for Transformer Explainability is accepted to CVPR 2024.
[02/2024] My first-author paper, Intent3D, for 3D Intention Grounding is now available on arXiv.
[10/2023] My first-author paper, ACTRESS, for Visual Grounding is now available on arXiv.
[08/2023] I am a Teaching Assistant of CS 577: Deep Learning at Illinois Institute of Technology.
[04/2023] My first-author paper, SegVG, for Visual Grounding is now available on arXiv.
[01/2023] My first-author paper, AttBalance, for Visual Grounding constraint is now available on arXiv.
[08/2022] I join Prof. Yan Yan's group as a Ph.D. student.

Publications

	ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu PDF
	GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, Yan Yan PDF
	Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan ICCV, 2025 PDF / Code
	Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan ICLR, 2025 Project Page / PDF / Code
	InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction Bin Lei, Weitai Kang, Zijian Zhang, Winson Chen, Xi Xie, Shan Zuo, Mimi Xie, Ali Payani, Mingyi Hong, Yan Yan, Caiwen Ding * Equal contribution NeurIPS, 2025 PDF / Code
	SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding Weitai Kang, Gaowen Liu, Mubarak Shah, Yan Yan ECCV, 2024 PDF / Code
	AttBalance: Visual Grounding with Attention-Driven Constraint Balancing Weitai Kang, Luowei Zhou, Junyi Wu, Changchang Sun, Yan Yan ACM MM, 2025 PDF
	Interpolating Video-LLMs: Toward Longer-sequence LMMs in a Training-free Manner Yuzhang Shang, Bingxin Xu, Weitai Kang, Mu Cai, Yuheng Li, Zehao Wen, Zhen Dong, Kurt Keutzer, Yong Jae Lee, Yan Yan PDF
	ACTRESS: Active Retraining for Semi-supervised Visual Grounding Weitai Kang, Mengxue Qu, Yunchao Wei, Yan Yan PDF
	3DResT: A Strong Baseline for Semi-Supervised 3D Referring Expression Segmentation Wenxin Chen, Mengxue Qu, Weitai Kang, Yan Yan, Yao Zhao, Yunchao Wei IEEE Transactions on Multimedia PDF
	Infant Agent: A Tool-Integrated, Logic-Driven Agent with Cost-Effective API Usage Bin Lei, Yuchen Li, Yiming Zeng, Tao Ren, Yi Luo, Tianyu Shi, Zitian Gao, Zeyu Hu, Weitai Kang, Qiuwu Chen PDF
	On the Faithfulness of Vision Transformer Explanations Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, Yan Yan CVPR, 2024 PDF
	Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan CVPR, 2024 PDF

Work Experiences

	Adobe · Research Internship Research on Large Multimodal Model. May 2025 - Aug. 2025, San Jose, California, United States · On-site Aug 2025 - Dec. 2025, Chicago, Illinois, United States · Remote
	SonyAI · Research Internship Research on 2D Large Multimodal Model. Oct. 2024 - Dec. 2024, Chicago, Illinois, United States · Remote
	Tencent · Machine Learning Engineer Internship Work on Human Pose Detection. Oct. 2021 - Jul. 2022, Shenzhen, Guangdong, China · On-site
	SenseTime · Research Internship Research on Video Super-Resolution. Jul. 2021 - Sep. 2021, Shenzhen, Guangdong, China · On-site

You can also reach me through WeChat: Victor_Hong_