Table of Links
Abstract and 1. Introduction
1.1 Printing Press in Iraq and Iraqi Kurdistan
1.2 Challenges in Historical Documents
1.3 Kurdish Language
-
Related work and 2.1 Arabic/Persian
2.2 Chinese/Japanese and 2.3 Coptic
2.4 Greek
2.5 Latin
2.6 Tamizhi
-
Method and 3.1 Data Collection
3.2 Data Preparation and 3.3 Preprocessing
3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation
-
Experiments, Results, and Discussion and 4.1 Processed Data
4.2 Dataset and 4.3 Experiments
4.4 Results and Evaluation
4.5 Discussion
-
Conclusion
5.1 Challenges and Limitations
Online Resources, Acknowledgments, and References
2.5 Latin
Vamvakas et al. (2008) presented a complete OCR methodology for recognizing historical documents. It is possible to apply this methodology to both machine-printed and handwritten documents. Due to its ability to adjust depending on the type of documents that we wish to process, it does not require any knowledge of fonts or databases. Three steps were involved in the methodology: The first two involved creating a database for training based on a set of documents, and the third involved recognizing new documents. First, a pre-processing step that includes image binarization and enhancement takes place. In the second step, a top-down segmentation approach is used to detect text lines, words, and characters. A clustering scheme is then adopted to group characters of similar shapes. In this process, the user is able to interact at any time in order to correct errors in clustering and assign an ASCII label. Upon completion of this step, a database is created for the purpose of recognition. Lastly, in the third step, the above segmentation approach is applied to every new document image, while the recognition is performed using the character database that was created in the previous step. Based on the results of the experiments, the model was found to be 83.66% accurate. Efforts will be taken in the future to optimize the current recognition results by exploiting new approaches for segmentation and new types of features to optimize the current recognition results.
In typical OCR systems, binarization is a crucial preprocessing stage where the input image is transformed into a binary form by removing unwanted elements, resulting in a clean and binarized version for further processing. However, binarization is not always perfect, and artifacts introduced during this process can lead to the loss of important details, such as distorted or fragmented character shapes. Particularly in historical documents, which are prone to higher levels of noise and degradation, binarization methods tend to perform poorly, impeding the effectiveness of the overall recognition pipeline. To address this issue, Yousefi et al. (2015) proposes an alternative approach that bypasses the traditional binarization step. They propose training a 1D LSTM network directly on gray-level text data. For their experiments, they curated a large dataset of historical Fraktur documents from publicly accessible online sources, which served as training and test sets for both grayscale and binary text lines. Additionally, to investigate the impact of resolution, they utilized sets of both low and high resolutions in their experiments. The results demonstrated the effectiveness of the 1D LSTM network compared to binarization. The network achieved significantly lower error rates, outperforming binarization by 24% on the low-resolution set and 19% on the high-resolution set. This approach offers a promising alternative by leveraging LSTM networks to directly process gray-level text data, bypassing the limitations and artifacts associated with traditional binarization methods. It proves particularly beneficial for historical documents and provides improved accuracy in OCR tasks.
According to Springmann et al. (2016), achieving accurate OCR results for historical printings requires training recognition models using diplomatic transcriptions, which are scarce and time-consuming resources. To overcome this challenge, the authors propose a novel method that avoids training separate models for each historical typeface. Instead, they employ mixed models initially trained on transcriptions from six printings spanning the years 1471 to 1686, encompassing various fonts. The results demonstrate that using mixed models yields character accuracy rates exceeding 90% when evaluated on a separate test set comprising six additional printings from the same historical period. This finding suggests that the typography barrier can be overcome by expanding the training beyond a limited number of fonts to encompass a broader range of (similar) fonts used over time. The outputs of the mixed models serve as a starting point for further development using both fully automated methods, which employ the OCR results of mixed models as pseudo ground truth for training subsequent models, and semi-automated methods that require minimal manual transcriptions. In the absence of actual ground truth, the authors introduce two easily observable quantities that exhibit a strong correlation with the actual accuracy of each generated model during the training process. These quantities are the mean character confidence (C), determined by the OCR engine OCRopus, and the mean token lexicality (L), which measures the distance between OCR tokens and modern wordforms while accounting for historical spelling patterns. Through an ordinal classification scale, the authors determine the most effective model in recognition, taking into account the calculated C and L values. The results reveal that a wholly automatic method only marginally improves OCR results compared to the mixed model, whereas hand-corrected lines significantly enhance OCR accuracy, resulting in considerably lower character error rates. The objective of this approach is to minimize the need for extensive ground truth generation and to avoid relying solely on a pre-existing typographical model. By leveraging mixed models and incorporating manual corrections, the proposed method demonstrates advancements in OCR results for historical printings, offering a more efficient and effective approach to training recognition models.
Bukhari et al. (2017) introduced the ”anyOCR” system, which focuses on the accurate digitization of historical archives. This system, being open-source, allows the research community to easily employ anyOCR for digitizing historical archives. It encompasses a comprehensive document processing pipeline that supports various stages, including layout analysis, OCR model training, text line prediction, and web applications for layout analysis and OCR error correction. One notable feature of anyOCR is its capability to handle contemporary images of documents with diverse layouts, ranging from simple to complex. Leveraging the power of LSTM networks, modern OCR systems enable text recognition. Moreover, anyOCR incorporates an unsupervised OCR training framework called anyOCRModel, which can be readily trained for any script and language. To address layout and OCR errors, anyOCR offers web applications with interactive tools. The anyLayoutEdit component enables users to rectify layout issues, while the anyOCREdit component allows for the correction of OCR errors. Additionally, the research community can access a Dockerized Virtual Machine (VM) that comes pre-installed with most of the essential components, facilitating easy setup and deployment. By providing these components and tools, anyOCR empowers the research community to utilize and enhance them according to their specific requirements. This collaborative approach encourages further refinement and advancements in the field of historical archive digitization.
Springmann et al. (2018) provided resources for historical OCR called the GT4HistOCR dataset, which consists of printed text line images accompanied by corresponding transcriptions. This dataset encompasses a total of 313,173 line pairs derived from incunabula spanning the 15th to the 19th centuries. It is made publicly available under the CC-BY 4.0 license, ensuring accessibility and usability. The GT4HistOCR dataset is particularly well-suited for training advanced recognition models in OCR software that utilize recurrent neural networks, specifically the LSTM architecture, such as Tesseract 4 or OCRopus. To assist researchers, the authors have also provided pretrained OCRopus models specifically tailored to subsets of the dataset. These pretrained models demonstrate impressive character accuracy rates of 95 percent for early printings and 98.5 percent for 19th-century Fraktur printings, showcasing their effectiveness even on unseen test cases.
According to Nunamaker et al. (2016), historical document images must be accompanied by ground truth text for training an OCR system. However, this process typically requires linguistic experts to manually collect the ground truths, which can be time-consuming and labor-intensive. To address this challenge, the authors propose a framework that enables the autonomous generation of training data using labelled character images and a digital font, eliminating the need for manual data generation. In their approach, instead of using actual text from sample images as ground truth, the authors generate arbitrary and rule-based ”meaningless” text for both the image and the corresponding ground truth text file. They also establish a correlation between the similarity of character samples in a subset and the performance of classification. This allows them to create upper- and lower-bound performance subsets for model generation using only the provided sample images. Surprisingly, their findings indicate that using more training samples does not necessarily improve model performance. Consequently, they focus on the case of using just one training sample per character. By training a Tesseract model with samples that maximize a dissimilarity metric for each character, the authors achieve a character recognition error rate of 15% on a custom benchmark of 15th-century Latin documents. In contrast, when a traditional Tesseract-style model is trained using synthetically generated training images derived from real text, the character recognition error rate increases to 27%. These results demonstrate the effectiveness of their approach in generating training data autonomously and improving the OCR performance for historical documents.
Koistinen et al. (2017) documented the efforts undertaken by the National Library of Finland (NLF) to enhance the quality of OCR for their historical Finnish newspaper collection spanning the years 1771 to 1910. To analyze this collection, a sample of 500,000 words from the Finnish language section was selected. The sample consisted of three parallel sections: a manually corrected ground truth version, an OCR version corrected using ABBYY FineReader version 7 or 8, and an ABBYY FineReader version 11-reOCR version. Utilizing the original page images and this sample, the researchers devised a re-OCR procedure using the open-source software Tesseract version 3.04.01. The findings reveal that their method surpassed the performance of ABBYY FineReader 7 or 8 by 27.48 percentage points and ABBYY FineReader 11 by 9.16 percentage points. At the word level, their method outperformed ABBYY FineReader 7 or 8 by 36.25 percent and ABBYY FineReader 11 by 20.14 percent. The recall and precision results for the re-OCRing process, measured at the word level, ranged between 0.69 and 0.71, surpassing the previous OCR process. Additionally, other metrics such as the ability of the morphological analyzer to recognize words and the rate of character accuracy demonstrated a significant improvement following the re-OCRing process.
Reul et al. (2018) examined the performance of OCR on 19th-century Fraktur scripts using mixed models. These models are trained to recognize various fonts and typesets from previously unseen sources. The study outlines the training process employed to develop robust mixed OCR models and compares their performance to freely available models from popular open-source engines such as OCRopus and Tesseract, as well as to the most advanced commercial system, ABBYY. To evaluate a substantial amount of unknown information, the researchers utilized 19th-century data extracted from books, journals, and a dictionary. Through the experiment, they found that combining models with real data yielded better results compared to combining models with synthetic data. Notably, the OCR engine Calamari demonstrated superior performance compared to the other engines assessed. It achieved an average CER of less than 1 percent, a significant improvement over the CER exhibited by ABBYY.
According to Romanello et al. (2021), commentaries have been a vital publication format in literary and textual studies for over a century, alongside critical editions and translations. However, the utilization of thousands of digitized historical commentaries, particularly those containing Greek text, has been challenging due to the limitations of OCR systems in terms of poor-quality results. In response to this, the researchers aimed to evaluate the performance of two OCR algorithms specifically designed for historical classical commentaries. The findings of their study revealed that the combination of Kraken and Ciaconna algorithms achieved a significantly lower CER compared to Tesseract/OCR-D (average CER of 7% versus 13% for Tesseract/OCR-D) in sections of commentaries containing high levels of polytonic Greek text. On the other hand, in sections predominantly composed of Latin script, Tesseract/OCR-D exhibited slightly higher accuracy than Kraken + Ciaconna (average CER of 8.2% versus 8.2%). Additionally, the study highlighted the availability of two resources. Pogretra is a substantial collection of training data and pre-trained models specifically designed for ancient Greek typefaces. On the other hand, GT4HistComment is a limited dataset that provides OCR ground truth specifically for 19th-century classical commentaries.
According to Skelbye and Dann´ells (2021), the use of deep CNN-LSTM hybrid neural networks has proven to be effective in improving the accuracy of OCR models for various languages. In their study, the authors specifically examined the impact of these networks on OCR accuracy for Swedish historical newspapers. By employing the open source OCR engine Calamari, they developed a mixed deep CNN-LSTM hybrid model that surpassed previous models when applied to Swedish historical newspapers from the period between 1818 and 1848. Through their experiments using nineteenth-century Swedish newspaper text, they achieved a remarkable average Character Accuracy Rate (CAR) of 97.43 percent, establishing a new state-of-the-art benchmark in OCR performance.
Based on Aula (2021), scanned documents can contain deteriorations acquired over time or as a result of outdated printing methods. There are a variety of visual attributes that can be observed on these documents, such as variations in style and font, broken characters, varying levels of ink intensity, noise levels and damage caused by folding or ripping, among others. Modern OCR tools are unfavorable to many of these attributes, leading to failures in the recognition of characters. To improve the result of character recognition, they used image processing methods. Furthermore, common image quality characteristics of scanned historical documents with unidentifiable text were analyzed. For the purposes of this study, the opensource Tesseract software was used for optical character recognition. To prepare the historical documents for Tesseract, Gaussian lowpass filtering, Otsu’s optimum thresholding method, and morphological operations were employed. The OCR output was evaluated based on the Precision and Recall classification method. It was found that the recall had improved by 63 percentage points and the precision by 18 percentage points. This study demonstrated that using image pre-processing methods to improve the readability of historical documents for the use of OCR tools has been effective.
According to Gilbey and Sch¨”onlieb (2021), it is noted that historical and contemporary printed documents often have extremely low resolutions, such as 60 dots per inch (dpi). While humans can still read these scans fairly easily, OCR systems encounter significant challenges. The prevailing approach involves employing a super-resolution reconstruction method to enhance the image, which is then fed into a standard OCR system along with an approximation of the original high-resolution image. However, the researchers propose an end-to-end method that eliminates the need for the super-resolution phase, leading to superior OCR results. Their approach utilizes neural networks for OCR and draws inspiration from the human visual system. Remarkably, their experiments demonstrate that OCR can be successfully performed on scanned images of English text with a resolution as low as 60 dpi, which is considerably lower than the current state of the art. The results showcase an impressive CLA of 99.7% and a Word Level Accuracy (WLA) of 98.9% across a corpus comprising approximately 1000 pages of 60 dpi text in diverse fonts. When considering 75 dpi images as an example, the mean CLA and WLA achieved were 99.9% and 99.4%, respectively.
Authors:
(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);
(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).
This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.