Wenxuan Wang (王文轩)

I am currently a third-year PhD student at the Institute of Automation, Chinese Academy of Sciences, co-supervised with the Beijing Academy of Artificial Intelligence by Prof. Jing Liu and Dr. Xinlong Wang. My research interests span Foundation Models, Native Multimodal Models, Generative Models, and Visual Grounding.

Email  /  Google Scholar  /  Github  /  Curriculum Vitae

profile photo
Education

  • University of Science and Technology Beijing, Sept. 2016 - Jun. 2020, B.S. in School of Automation.
  • University of Science and Technology Beijing, Sept. 2020 - Jun. 2023, M.S. in School of Automation.
  • Institute of Automation, Chinese Academy of Sciences, Sept. 2023 - Jun. 2026, Ph.D. in Zidongtaichu Foundation Model Research Center.
  • Beijing Academy of Artificial Intelligence, Sept. 2023 - Jun. 2026, Ph.D. in Multimodal Large Model Research Center.
  • Recent Projects

    * indicates equal contribution

    dise Emu3.5: Native Multimodal Models are World Learners
    Yufeng Cui*, Honghao Chen*, Haoge Deng*, Xu Huang*, Xinghang Li*, Jirong Liu*, Yang Liu*, Zhuoyan Luo*, Jinsheng Wang*, Wenxuan Wang*, Yueze Wang*, Chengyuan Wang*, Fan Zhang*, Yingli Zhao*, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
    arXiv, 2025
    [paper] [Page] [Code]

    a large-scale multimodal world model that natively predicts the next state across vision and language

    First-author Publications

    * indicates equal contribution

    dise End-to-End Vision Tokenizer Tuning
    Wenxuan Wang*, Fan Zhang*, Yufeng Cui*, Haiwen Diao*, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang
    NeurIPS, 2025
    [paper]

    an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks

    dise Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
    Jing Liu*, Wenxuan Wang*, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang
    arXiv, 2025
    [paper]

    takes a step further towards visual granularity unified RES task

    dise Image Difference Grounding with Natural Language
    Wenxuan Wang*, Zijia Zhao*, Yisi Zhang*, Yepeng Tang, Erdong Hu, Xinlong Wang, Jing Liu
    arXiv, 2025
    [paper]

    push towards precisely localizing visual differences based on user instructions

    dise Diffusion Feedback Helps CLIP See Better
    Wenxuan Wang*, Quan Sun*, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
    ICLR, 2025
    [paper] [Page] [Code]

    leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text)

    dise Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
    Wenxuan Wang*, Yisi Zhang*, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu
    ACL, 2024 (Findings)
    [paper]

    takes a step further to the intention-driven visual-language understanding and promotes classic visual grounding towards human intention interpretation

    dise Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation
    Wenxuan Wang*, Tongtian Yue*, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu
    CVPR, 2024
    [paper] [Page] [Code]

    takes a step further to finer-grained part-level referring expression segmentation task

    dise CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation
    Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, Jiangyun Li
    IEEE-TMM, 2024
    [paper]

    a new cross-modality masked self-distillation framework for referring image segmentation task

    Co-author Publications

    * indicates equal contribution

    dise EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
    Haiwen Diao*, Xiaotong Li*, Yufeng Cui*, Yueze Wang*, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
    ICCV, 2025 (highlight)
    [paper] [Code]

    encoder-free vision-language models

    dise Unified Vision-Language-Action Model
    Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang
    arXiv, 2025
    [paper] [Page] [Code]

    unified vision-language-action model for embodied intelligence

    Honors and Awards

  • 2025 National Scholarship (Ph.D.)
  • 2022 National Scholarship (Master)

  • Website Template


    © Wenxuan Wang | Last updated: November 3, 2025