Does text matter?
Extending CLIP with OCR and NLP for image classification and retrieval
More Info
expand_more
Abstract
Contrastive Language-Image Pretraining (CLIP) has gained vast interest due to its impressive performance on a variety of computer vision tasks: image classification, image retrieval, action recognition, feature extraction, and more. The model learns to associate images with their descriptions, a powerful method which allows it to perform well on unseen domains. Often, the descriptions fail to capture text which is contained within the image, a source of information which could prove useful for a handful of computer vision tasks. This limitation requires finetuning in domains where contained text is important. In fact, CLIP has mixed performance on Optical Character Recognition (OCR). This paper proposes a novel architecture: OSBC (OCR Sentence BERT CLIP), which combines CLIP and a custom text extraction pipeline, composed of an OCR model, and a Natural Language Processing (NLP) model. OSBC uses the text contained within images as an additional feature when performing image classification and retrieval. We tested the model on multiple datasets for each task, occasionally outperforming CLIP when images contained text, while maintaining finetunability, and improving the model's robustness. In addition, OSBC was designed to be generalizable, meaning it is expected to perform well on unseen domains without finetuning, though this was not achieved in practice.