Post-OCR text correction for Bulgarian historical documents
Beshirov, Angel and Dobreva, Milena and Dimitrov, Dimitar and Hardalov, Momchil and Koychev, Ivan and Nakov, Preslav (2025) Post-OCR text correction for Bulgarian historical documents. International Journal on Digital Libraries, 26. 4. ISSN 1432-5012 (https://doi.org/10.1007/s00799-025-00415-x)
Preview |
Text.
Filename: Beshirov-etal-IJDL-2025-Post-OCR-text-correction-for-Bulgarian-historical-documents.pdf
Final Published Version License: ![]() Download (841kB)| Preview |
Abstract
The digitization of historical documents is crucial for preserving the cultural heritage of the society. An essential step in this process is converting scanned images to text using Optical Character Recognition (OCR), which can enable further search, information extraction, etc. Unfortunately, this is a challenging problem as standard OCR tools are not tailored to deal with historical orthography or challenging layouts. Thus, it is standard to apply an additional text correction step on the OCR output when dealing with such documents. In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during the recognition. It improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset. We release our data and code at https://github.com/angelbeshirov/post-ocr-text-correction.
ORCID iDs
Beshirov, Angel, Dobreva, Milena
-
-
Item type: Article ID code: 92317 Dates: DateEvent21 February 2025Published21 February 2025Published Online19 January 2025AcceptedSubjects: Bibliography. Library Science. Information Resources > Library Science. Information Science Department: Faculty of Science > Computer and Information Sciences Depositing user: Pure Administrator Date deposited: 12 Mar 2025 11:47 Last modified: 12 Mar 2025 11:47 URI: https://strathprints.strath.ac.uk/id/eprint/92317