|
Education
University of Science and Technology Beijing, Sept. 2016 - Jun. 2020, B.S. in School of Automation.
University of Science and Technology Beijing, Sept. 2020 - Jun. 2023, M.S. in School of Automation.
Institute of Automation, Chinese Academy of Sciences, Sept. 2023 - Jun. 2026, Ph.D. in Zidongtaichu Foundation Model Research Center.
Beijing Academy of Artificial Intelligence, Sept. 2023 - Jun. 2026, Ph.D. in Multimodal Large Model Research Center.
|
|
Recent Projects
* indicates equal contribution
|
|
Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui*, Honghao Chen*, Haoge Deng*, Xu Huang*, Xinghang Li*, Jirong Liu*, Yang Liu*, Zhuoyan Luo*, Jinsheng Wang*, Wenxuan Wang*, Yueze Wang*, Chengyuan Wang*, Fan Zhang*, Yingli Zhao*, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
arXiv, 2025
[Paper] [Page] [Code]
a large-scale multimodal world model that natively predicts the next state across vision and language
|
|
First-author Publications
* indicates equal contribution
|
|
End-to-End Vision Tokenizer Tuning
Wenxuan Wang*, Fan Zhang*, Yufeng Cui*, Haiwen Diao*, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang
NeurIPS, 2025
[Paper]
an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks
|
|
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
Jing Liu*, Wenxuan Wang*, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang
arXiv, 2025
[Paper]
takes a step further towards visual granularity unified RES task
|
|
Image Difference Grounding with Natural Language
Wenxuan Wang*, Zijia Zhao*, Yisi Zhang*, Yepeng Tang, Erdong Hu, Xinlong Wang, Jing Liu
arXiv, 2025
[Paper]
push towards precisely localizing visual differences based on user instructions
|
|
Diffusion Feedback Helps CLIP See Better
Wenxuan Wang*, Quan Sun*, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
ICLR, 2025
[Paper] [Page] [Code]
leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text)
|
|
Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
Wenxuan Wang*, Yisi Zhang*, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu
ACL, 2024 (Findings)
[Paper]
takes a step further to the intention-driven visual-language understanding and promotes classic visual grounding towards human intention interpretation
|
|
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation
Wenxuan Wang*, Tongtian Yue*, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu
CVPR, 2024
[Paper] [Page] [Code]
takes a step further to finer-grained part-level referring expression segmentation task
|
|
CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation
Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, Jiangyun Li
IEEE-TMM, 2024
[Paper]
a new cross-modality masked self-distillation framework for referring image segmentation task
|
|
Co-author Publications
* indicates equal contribution
|
|
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Haiwen Diao*, Xiaotong Li*, Yufeng Cui*, Yueze Wang*, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
ICCV, 2025 (highlight)
[Paper] [Code]
encoder-free vision-language models
|
|
Unified Vision-Language-Action Model
Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang
arXiv, 2025
[Paper] [Page] [Code]
unified vision-language-action model for embodied intelligence
|
|
Honors and Awards
2025 National Scholarship (Ph.D.)
2022 National Scholarship (Master)
|
© Wenxuan Wang | Last updated: November 3, 2025
|