| 
            Education
             
               
                University of Science and Technology Beijing, Sept. 2016 - Jun. 2020, B.S. in School of Automation.
              
                University of Science and Technology Beijing, Sept. 2020 - Jun. 2023, M.S. in School of Automation.
              
                Institute of Automation, Chinese Academy of Sciences, Sept. 2023 - Jun. 2026, Ph.D. in Zidongtaichu Foundation Model Research Center.
              
                Beijing Academy of Artificial Intelligence, Sept. 2023 - Jun. 2026, Ph.D. in Multimodal Large Model Research Center.
              
            
           | 
         
       
        
            
            | 
              Recent Projects
               
                * indicates equal contribution
               
             | 
           
         
        
          
            
                 
             | 
            
                Emu3.5: Native Multimodal Models are World Learners
                 
                Yufeng Cui*, Honghao Chen*, Haoge Deng*, Xu Huang*, Xinghang Li*, Jirong Liu*, Yang Liu*, Zhuoyan Luo*, Jinsheng Wang*, Wenxuan Wang*, Yueze Wang*, Chengyuan Wang*, Fan Zhang*, Yingli Zhao*, Ting Pan, Xianduo Li, Zecheng Hao, Wenxuan Ma, Zhuo Chen, Yulong Ao, Tiejun Huang, Zhongyuan Wang, Xinlong Wang
                 
                arXiv, 2025
                 
                [paper] [Page] [Code]
                 
                 a large-scale multimodal world model that natively predicts the next state across vision and language  
             | 
             
        
            
            | 
              First-author Publications
               
                * indicates equal contribution
               
             | 
           
         
        
          
            
                 
             | 
            
                End-to-End Vision Tokenizer Tuning
                 
                Wenxuan Wang*, Fan Zhang*, Yufeng Cui*, Haiwen Diao*, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang
                 
                NeurIPS, 2025
                 
                [paper]
                 
                 an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks  
             | 
             
          
            
                 
             | 
            
                Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
                 
                Jing Liu*, Wenxuan Wang*, Yisi Zhang, Yepeng Tang, Xingjian He, Longteng Guo, Tongtian Yue, Xinlong Wang
                 
                arXiv, 2025
                 
                [paper]
                 
                 takes a step further towards visual granularity unified RES task  
             | 
             
          
            
                 
             | 
            
                Image Difference Grounding with Natural Language
                 
                Wenxuan Wang*, Zijia Zhao*, Yisi Zhang*, Yepeng Tang, Erdong Hu, Xinlong Wang, Jing Liu
                 
                arXiv, 2025
                 
                [paper]
                 
                 push towards precisely localizing visual differences based on user instructions  
             | 
             
          
          
            
                 
             | 
            
                Diffusion Feedback Helps CLIP See Better
                 
                Wenxuan Wang*, Quan Sun*, Fan Zhang, Yepeng Tang, Jing Liu, Xinlong Wang
                 
                ICLR, 2025
                 
                [paper] [Page] [Code]
                 
                 leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text)  
             | 
             
          
          
            
                 
             | 
            
                Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions
                 
                Wenxuan Wang*, Yisi Zhang*, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu
                 
                ACL, 2024 (Findings)
                 
                [paper]
                 
                 takes a step further to the intention-driven visual-language understanding and promotes classic visual grounding towards human intention interpretation  
             | 
             
          
            
                 
             | 
            
                Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation
                 
                Wenxuan Wang*, Tongtian Yue*, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu
                 
                CVPR, 2024
                 
                [paper] [Page] [Code]
                 
                 takes a step further to finer-grained part-level referring expression segmentation task  
             | 
             
          
          
            
                 
             | 
            
                CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation
                 
                Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, Jiangyun Li
                 
                IEEE-TMM, 2024
                 
                [paper]
                 
                 a new cross-modality masked self-distillation framework for referring image segmentation task  
             | 
             
        
        
            
            | 
              Co-author Publications
               
                * indicates equal contribution
               
             | 
           
         
  
  
        
          
            
                 
             | 
            
                EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
                 
                Haiwen Diao*, Xiaotong Li*, Yufeng Cui*, Yueze Wang*, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
                 
                ICCV, 2025 (highlight)
                 
                [paper] [Code]
                 
                 encoder-free vision-language models  
             | 
             
          
            
                 
             | 
            
                Unified Vision-Language-Action Model
                 
                Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, Zhaoxiang Zhang
                 
                arXiv, 2025
                 
                [paper] [Page] [Code]
                 
                 unified vision-language-action model for embodied intelligence  
             | 
             
            
        
            
            | 
              Honors and Awards
               
                 2025 National Scholarship (Ph.D.)
                2022 National Scholarship (Master)
              
             | 
           
         
        
      
    
   
  
	  
    
	           
	   
	    © Wenxuan Wang | Last updated: November 3, 2025
   |