Enhancing Historical Dutch OCR Accuracy with Post-Correction & Synthetic Data

More Info
expand_more

Abstract

This paper presents a novel approach to synthetic data generation for OCR post-correction, utilizing specific background and font variations tailored to specific timeperiods. The goal is to use synthetic data to enhance text accuracy in digitized historical documents. The proposed three-step process involves generating synthetic images that emulate the characteristics of historical documents from different years, incorporating year-specific backgrounds and fonts. Using these images, a dataset can be created. Multiple T5 sequence-to-sequence transformers are then fine-tuned on the generated dataset. The trained models demonstrate the capabilities of improving the OCR, and aligning them with the ground-truth text. The effectiveness of the approach is evaluated through various performance metrics, highlighting the benefits of using year-specific synthetic data for training. This work contributes to the field of OCR post-correction by providing a powerful framework for improving the accuracy of OCR systems in historical OCR text tasks.

Files

Thesis_Thomas_Eckhardt.pdf
(pdf | 19.7 Mb)
Unknown license