Automated Text-Image Comic Dataset Construction

Styczeń, M.

Automated Text-Image Comic Dataset Construction

Bachelor thesis (2021)

Authors

M. Styczeń Electrical Engineering, Mathematics and Computer Science

Contributors

Lydia Chen Data-Intensive Systems (mentor)

Zilong Zhao Data-Intensive Systems (mentor)

Arie van Deursen Software Technology (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Comics Ocr Dataset creation Scraping

To reference this document use:

http://resolver.tudelft.nl/uuid:4cd4da94-d762-4985-ac30-00ee81e33e63

More Info

expand_more

Published Date

01-07-2021

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Comic illustrations and transcriptions form an attractive dataset for several problems, including computer vision tasks, such as recognizing character’s faces, generating new comics, or natural language processing tasks like automated comic translation or detecting emotion in the dialogues. However, despite a large number of comic strips published online, very few datasets of annotated comic illustrations are available. This forms a bottleneck for further advancements in the field. The source of the data scarcity is the manual labor required for annotation — one has to download the comic strips, separate each strip into panels (individual illustrations), and transcribe the text. Automating the process is needed, but it poses several challenges. Panel detection in comic strips is non-trivial, due to varying layouts and styles of comics. Automated transcription is also challenging, as the out-of-the-box optical character recognition (OCR) models struggle with diverse fonts, hand-writing styles, and backgrounds. We design an automatic comic text-image dataset construction pipeline, termed DCP, consisting of three components: (i) web scraping, (ii) panel extraction, and (iii) text extraction. A multi-threaded comic scraper is created to download all the comics. A panel extraction algorithm based on panel frame detection is developed to divide the comic strips into individual illustrations. Lastly, to effectively extract the text using OCR, we propose additional pre-processing and post-processing steps, namely, up-scaling and binarizing images, clustering-based text ordering, and dictionary-based autocorrect. We extensively evaluate the prototype of DCP on three comic series: PHD Comics, Dilbert, and Garfield. Web scraping is used to downloading over 25000 comic strips at an average pace of 149ms per image. Panel extraction results on 1118 panels show success rates of 100%, 97%, and 71% for Dilbert, PHD Comics and Garfield respectively, outperforming the baseline in terms of accuracy and speed. The text extraction algorithm, tested on 1100 comics, achieves a 7x error reduction compared to the out-of-the-box OCR.

Files

Thesis_1.pdf

(pdf | 2.33 Mb)

Unknown license