Authors:
(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);
(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).
Table of Links
Abstract and 1. Introduction
1.1 Printing Press in Iraq and Iraqi Kurdistan
1.2 Challenges in Historical Documents
1.3 Kurdish Language
-
Related work and 2.1 Arabic/Persian
2.2 Chinese/Japanese and 2.3 Coptic
2.4 Greek
2.5 Latin
2.6 Tamizhi
-
Method and 3.1 Data Collection
3.2 Data Preparation and 3.3 Preprocessing
3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation
-
Experiments, Results, and Discussion and 4.1 Processed Data
4.2 Dataset and 4.3 Experiments
4.4 Results and Evaluation
4.5 Discussion
-
Conclusion
5.1 Challenges and Limitations
Online Resources, Acknowledgments, and References
Abstract
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish language’s resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseract’s built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.
1 Introduction
Over the course of centuries, human experience has produced invaluable treasures in the form of historical documents. Due to the large amount of work required for manual annotation and transcription of historical documents, many archives of historical documents remain inaccessible (Ataer and Duygulu, 2007). Through digitization, these documents can be understood and protected efficiently and effectively. In this process, actual documents are systematically converted into digital records based on the precise recognition of characters in the original document (Yang et al., 2018). Because of the demand for maintaining and making historical documents available for research without damaging physical copies, many languages and regions started practicing and studying digitization and preservation of the digital reproduction of historical documents (Nguyen et al., 2017). According to Poncelas et al. (2020), building Optical Character Recognition (OCR) that recognizes and extracts text from historical documents is a challenging task, and some unique sets of issues can affect the result of the model. Typeface inconsistency and bad-quality images are some examples of the challenges. Figure 1 is a sample page with these challenges. As a result of these issues, most of the advanced OCR systems produce errors which is why researchers continue their efforts to find new methods to enhance the OCR engines to generate better output.
Initially, historical documents were painstakingly created by hand, leading to their restricted availability and limited distribution. However, the introduction of the printing press by Johannes Gutenberg in 1436 in Germany marked a significant milestone. The printing press, a mechanical device designed for printing high-volume publications, revolutionized the production of historical documents. This apparatus applies pressure on an inked surface, as depicted in Figure ??. The printing press is widely recognized as one of the most remarkable accomplishments in history, facilitating the widespread dissemination and preservation of knowledge (Qania, 2012).
As for the Kurdish press history, it is about one century old, and the devices used for printing were hugely different from what we have today. The devices underwent many changes and improvements until we reached what we have today.
Publications printed with the printing press have various issues. One of them is the lack of standard font for writing, the use of many Arabic styles, and on top of them, all the books need to be in better shape as they are very fragile and damaged and there are many noticeable marks on them.
A few OCR systems currently support the Kurdish language, for example, the one by Idrees and Hassani (2021). Still, they cannot recognize these old publications due to the abovementioned issues. As for the old publications, some works have been done for the other languages that we go through in the literature review chapter.
This study focuses on enhancing an existing OCR system for the Kurdish language so we can recognize and extract text from historical Kurdish documents, which makes the related documents ready for further processing.
This paper is