Video Captioning for the Visually Impaired

More Info
expand_more

Abstract

Visual impairment affects over 2.2 billion individuals globally, emphasizing the critical need for effective assistive technologies. This work focuses on developing a video captioning model explicitly tailored for visually impaired users, leveraging advancements in deep learning techniques. Video captioning involves converting video frames into textual descriptions, effectively bridging the domains of computer vision (CV) and natural language processing (NLP). We surveyed young visually impaired individuals from the Visio organization, who provided key insights into the design of our model.
We enhance the existing S2VT model by modifying its temporal attention mechanism to improve the recognition of visual surroundings, addressing the unique challenges visually impaired individuals face.
This research explores critical questions surrounding the model's sensitivity to actions, the readability of generated captions, and methods for latency reduction. To evaluate the model's effectiveness, we implement readability metrics—an approach not previously utilized in video captioning assessments. Our findings contribute to enhancing accessibility and independence for visually impaired individuals through advanced video captioning solutions.