Automated Text-Image Comic Dataset Construction
More Info
expand_more
Abstract
Comic illustrations and transcriptions form an attractive dataset for several problems, including computer vision tasks, such as recognizing character’s faces, generating new comics, or natural language processing tasks like automated comic translation or detecting emotion in the dialogues. However, despite a large number of comic strips published online, very few datasets of annotated comic illustrations are available. This forms a bottleneck for further advancements in the field. The source of the data scarcity is the manual labor required for annotation — one has to download the comic strips, separate each strip into panels (individual illustrations), and transcribe the text. Automating the process is needed, but it poses several challenges. Panel detection in comic strips is non-trivial, due to varying layouts and styles of comics. Automated transcription is also challenging, as the out-of-the-box optical character recognition (OCR) models struggle with diverse fonts, hand-writing styles, and backgrounds. We design an automatic comic text-image dataset construction pipeline, termed DCP, consisting of three components: (i) web scraping, (ii) panel extraction, and (iii) text extraction. A multi-threaded comic scraper is created to download all the comics. A panel extraction algorithm based on panel frame detection is developed to divide the comic strips into individual illustrations. Lastly, to effectively extract the text using OCR, we propose additional pre-processing and post-processing steps, namely, up-scaling and binarizing images, clustering-based text ordering, and dictionary-based autocorrect. We extensively evaluate the prototype of DCP on three comic series: PHD Comics, Dilbert, and Garfield. Web scraping is used to downloading over 25000 comic strips at an average pace of 149ms per image. Panel extraction results on 1118 panels show success rates of 100%, 97%, and 71% for Dilbert, PHD Comics and Garfield respectively, outperforming the baseline in terms of accuracy and speed. The text extraction algorithm, tested on 1100 comics, achieves a 7x error reduction compared to the out-of-the-box OCR.