In a recent study, researchers have unveiled a method to significantly boost the performance of Vision and Language Models (VLMs), a technology at the forefront of visual recognition. By tapping into the vast knowledge of Large Language Models (LLMs), the team has demonstrated the improvements in domain-specific adaptation, fine-grained recognition, and zero-shot classification.
Vision and Language Models, exemplified by models such as CLIP(Contrastive Language-Image Pre-Training), are renowned for their ability to recognize a virtually unlimited range of categories described by text prompts. These models have shown progress in the recent past, making open-vocabulary zero-shot recognition a reality. The challenge lies in tailoring these models to specific downstream tasks, as they often differ from the general web-based pre-training data.The new approach, named Targeted-Prompting (TAP), addresses this challenge head-on. TAP prompts the LLM to generate text-only samples that emphasize the specific visual characteristics of a given task. These samples are then used to train a text classifier, which can directly classify visual data without needing paired image-text data.
The researchers tested TAP on a variety of datasets, witnessing improvements across the board. For instance, in tests involving domain-specific datasets such as UCF-101 and ImageNet-Rendition, TAP demonstrated significant enhancements in performance.
Aspect of this study is the exploitation of the shared text-image embedding space learned by models like CLIP. The approach allows for effective cross-modal transfer, training on text data and applying the knowledge to visual recognition tasks. This strategy opens up new horizons in the field of computer vision and language modeling. By reducing the reliance on vast visual datasets and harnessing the power of text data, the TAP approach could pave the way for more efficient and adaptable visual recognition systems in the future.