Attribute Focused Object Detection with Vision-Language Models

Zahran, Y.M.M.I.

Abstract

Despite significant advancements in computer vision models, their ability to generalize to novel object-attribute combinations remains limited. In Compositional Zero-Shot Learning (CZSL), the goal is to recognize all possible attribute-object combinations while training on only a subset of these compositions. We demonstrate that Vision-Language Models (VLMs) are suitable for CZSL due to their ability to understand both text and image modalities, and we show the importance of cross-modality fusion. Existing methods for CZSL mainly focus on image classification. In this thesis, we enhance CZSL in object detection without forgetting prior learned knowledge. We use Grounding DINO, a VLM for object detection, and create auxiliary tokens for the words present in the training dataset, training only these to mitigate forgetting. This technique, referred to as Compositional Soft Prompting (CSP), was initially used for image classification, and we incorporate it into Grounding DINO to leverage it for object detection. We show that for CZSL, it is beneficial to anticipate the orientation of embeddings in the embedding space, promoting independence between attributes and objects. Additionally, by anticipating possible compositions that could exist based on the attributes and objects present in the training data, we assign soft labels for partial correctness using a method we call Compositional Smoothing. This approach guides the model to learn what the compositions are composed of rather than learning the compositions themselves. We refer to the combined approach of these methods as Compositional Anticipation. Our approach achieves a 70.5% improvement over CSP on the harmonic mean (HM) between seen and unseen compositions on the CLEVR dataset. Furthermore, we introduce Contrastive Prompt Tuning to incrementally address model confusion between similar compositions. We demonstrate the effectiveness of this method, achieving an increase of 14.5% in HM across the pretrain, increment, and unseen sets. Collectively, these methods provide a framework for learning various compositions with limited data, as well as improving the performance of underperforming compositions when additional data becomes available.

Attribute Focused Object Detection with Vision-Language Models

Anticipating Future Object Compositions without Forgetting

Abstract

Files