Diffusion Feedback Helps CLIP See Better

Wenxuan Wang*1,2,3,   Quan Sun*3,   Fan Zhang3,   Yepeng Tang4,   Jing Liu1,2,   Xinlong Wang3,  
1Institute of Automation, Chinese Academy of Sciences (CASIA) 2School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS) 3Beijing Academy of Artificial Intelligence (BAAI) 4Beijing Jiaotong University (BJTU)

Abstract

MY ALT TEXT

Contrastive Language-Image Pre-training (CLIP) has become integral to various multimodal understanding tasks, serving as a backbone for models across multiple domains. However, recent studies reveal that CLIP struggles with distinguishing visual differences between similar images, which limits the perceptual capabilities of multimodal large language models (MLLMs) that rely on it. This work addresses these limitations by enhancing CLIP's ability to discern fine-grained visual details. Given the high cost associated with training CLIP from scratch, we explore a more feasible approach of fine-tuning CLIP. Directly collecting image-text pairs for fine-tuning is not only expensive but also constrained by data quality, and it can adversely affect CLIP's zero-shot performance. To mitigate these challenges, we propose a DIffusion-based self-supervised framework as a Visual Assistant for CLIP, named DIVA. Our method leverages generative feedback from text-to-image diffusion models, utilizing only images (without corresponding text) to refine CLIP representations. It integrates dense visual features from CLIP as conditioning inputs for the diffusion process. We demonstrate that DIVA improves CLIP's performance on the MMVP-VLM benchmark to a large extent (i.e., ↑ around 3-7%), which assesses fine-grained visual abilities, and enhances the performance of MLLMs and vision models on multimodal understanding and semantic segmentation tasks. Extensive evaluations on 29 image classification and retrieval benchmarks confirm that our framework preserves CLIP's strong zero-shot capabilities.

DIVA's Overall Architecture

MY ALT TEXT

Fine-grained Visual Perception Evaluation

MY ALT TEXT

Backbone Enhancement Performance Evaluation

MY ALT TEXT

Generalization Capability Evaluation

MY ALT TEXT

Qualitative Results

MY ALT TEXT

BibTeX

@article{wang2024diffusion,
      title={Diffusion Feedback Helps CLIP See Better},
      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
      journal={arXiv preprint arXiv:2407.20171},
      year={2024}
}