Table of Links
Abstract and 1. Introduction
1.1 Printing Press in Iraq and Iraqi Kurdistan
1.2 Challenges in Historical Documents
1.3 Kurdish Language
-
Related work and 2.1 Arabic/Persian
2.2 Chinese/Japanese and 2.3 Coptic
2.4 Greek
2.5 Latin
2.6 Tamizhi
-
Method and 3.1 Data Collection
3.2 Data Preparation and 3.3 Preprocessing
3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation
-
Experiments, Results, and Discussion and 4.1 Processed Data
4.2 Dataset and 4.3 Experiments
4.4 Results and Evaluation
4.5 Discussion
-
Conclusion
5.1 Challenges and Limitations
Online Resources, Acknowledgments, and References
1.2 Challenges in Historical Documents
It is crucial to understand the defects and degradations in historical documents clearly. Figure 3 illustrates these defects’ commonly encountered degraded documents.
1.2.1 Uneven Illumination
Uneven illumination in optical imaging degrades light microscopy images due to the diminishing of incident light along its path caused by particle spreading in the media. This leads to difficulties in document image analysis, especially in character recognition using OCR. Background objects, fluorescence overlays, and light scattering contribute to uneven illumination. This issue negatively affects efficient document recognition and can be seen in historical document examples. The typical OCR process of converting grayscale images to binary and extracting text is hindered by uneven illumination. Figure 4 shows an example of uneven illumination in historical documents.
1.2.2 Contrast Variation
Contrast refers to the variation in brightness within an image. It primarily represents the differences between high-intensity and low-intensity pixels or the disparities between object pixels and background pixels. Factors like noise, sunlight, illumination, and occlusion can cause non-linear and expressive variations in contrast. These variations pose challenges for document image analysis algorithms, particularly in applying traditional threshold-based methods to distinguish foreground text from the background in historical and handwritten documents. To address this issue, image enhancement techniques can be employed prior to image binarization. Figure 5 shows an example of contrast variation in handwritten historical documents.
1.2.3 Bleed-Through Degradation
Bleed-through, also known as ink bleeding, is a phenomenon where ink from one side of a paper document transfers to the other side, making the text illegible. This poses a significant challenge in document binarization, which aims to separate the foreground text from the background. Researchers addressing this issue have faced two major challenges: limited access to high-resolution degraded documents and the difficulty of quantitatively analyzing restoration outcomes without ground truth data. Solutions involve generating degraded images based on known ground truth or using known degraded images as references. Performance analysis can still be conducted by evaluating the impact of restoration on subsequent processes, such as OCR. Figure 6 shows an example of ink-bleed degradation.
1.2.4 Faded Ink or Faint Characters
There is a strong interest in digitizing official organizational papers for historical, public, and political purposes. However, typewritten documents present challenges for recognition. The intensity of each character can vary compared to the surrounding glyphs due to factors like the typewriter key’s striking head and the force applied during typing. Additionally, many typewritten documents exist only as carbon copies, resulting in blurry text due to the pressure required for clear imprints on both the original and carbon paper. Historical typewritten documents also face issues such as aging, tears, stains, rust, punch holes, disintegration, and discoloration. Figure 3 shows instances of scanned historical documents with faded ink degradation.
1.2.5 Smear or Show Through
After the digitization of documents, new challenges arise in the form of noise and low-resolution components that negatively affect the document’s visual appearance. Historical documents can suffer from various types of degradation, introduced over time and with different characteristics. One prominent issue is show-through, where ink impressions from one side of the paper appear on the other side, making the document difficult to read. Restoration techniques are necessary to make these documents easily readable. Removing show-through improves readability and reduces image compression time, allowing for faster downloading over the internet. Figure 8 Shows an example of such degradation.
1.2.6 Blur
Regarding document degradation, two types of blurring appear in documents: Motion blur and out-of-focus blur. In general, motion blur artifacts are caused by the relative speed between the camera and the object or a sudden rapid camera movement. In contrast, out-of-focus blur occurs when the light fails to converge in the image. In order to fix the blur issue, the research topics as of late have turned towards the tools for assessing blur in document images to figure out the accuracy of the OCR, hence providing the required response to the user to help them obtain new images in the hopes of getting better OCR outcomes. Some instances of blurring issues in degraded documents are displayed in Figure 9.
1.2.7 Thin or Weak Text
Historical documents often contain thin or weak text, typically written with ink or paint. Over time, the ink used in these documents may shrink and degrade, making the text difficult to
read. Additionally, using low-quality ink and paper can contribute to the appearance of thin or weak text, posing challenges for accurate text extraction through binarization methods. Recent research in prehistoric document image analysis has focused on addressing these challenges. Enhancement and binarization algorithms have been developed to improve the quality of thin or weak text in historical documents. Subsequent phases, such as skew detection, recognition, and page or line segmentation, have been created to process the binarized data. Figure 10 shows an example of thin or weak text.
1.2.8 Deteriorated Documents
Original paper-based documents can encompass various media types (such as ink, graphite, and watercolor) and formats (such as rolled maps, spreadsheets, and record books). These documents hold significant importance due to their informational, evidential, associational, and intrinsic values. The evidential value of a document, particularly in historical, legal, or scientific contexts, relies on preserving the original condition of the media, substrate, format, and images without significant alterations or deterioration. However, documents can face deterioration, loss, and damage not only through actual use but also due to factors like poor storage, handling, environmental conditions, and inherent instability. Environmental factors, especially for inherently unstable documents, can cause severe damage and deterioration. Figure 11 shows an example of a deteriorated document.
1.3 Kurdish Language
Kurdish refers to various dialects in the region, encompassing Iran, Iraq, Turkey, and Syria. Nevertheless, Kurds have resided in additional countries, including Armenia, Lebanon, Egypt, and others, for several centuries. Additionally, they have substantial diaspora communities in various European countries and North America (Hassani et al., 2016). The precise number of speakers for this language remains uncertain, with varying reports suggesting a population between 19 million to 28 million (Hassani et al., 2016). Scholars often describe Kurdish as a dialect continuum, wherein language intelligibility can vary across different regions. Generally, Kurdish is recognized to encompass three primary dialects: Northern Kurdish (Kurmanji), Central Kurdish (Sorani), and Southern Kurdish (Ahmadi et al., 2022). Kurdish utilizes four different scripts for writing, including modified Persian/Arabic, Latin, Yekgirtˆu (unified), and Cyrillic. The popularity and usage of these scripts vary depending on geographical and geopolitical factors (Hassani et al., 2016).
Sorani is commonly written using an adapted Persian/Arabic script with a cursive style, following a right-to-left (RTL) direction. See Figure 12 for Arabic Alphabets, Figure 13 for Perian Alphabet and Figure 14 for Kurdish alphabet Persian/Arabic script. On the other hand, Kurmanji predominantly employs the Latin script for writing, except in the Kurdistan Region of Iraq and Kurdish areas of Syria, where the same script as Sorani is utilized (Idrees and Hassani, 2021).
Sorani primarily employs a modified Persian/Arabic script, while Zazaki mainly utilizes the Latin script. Gorani (Hawrami), on the other hand, is primarily written in a modified Persian/Arabic script. It is worth noting that the term ”mainly” is used because there are significant exceptions in the usage of these scripts, particularly with regard to the Latin and modified Persian/Arabic scripts (Hassani et al., 2016).
The rest of the paper is organized as follows. Section 2 reviews the literature of OCR for historical documents for different languages. Section 3 presents the method that the research follows. We provide the results and discuss the outcome in Section 4. Finally, Section 5 concludes the study, summarizes the findings, and introduces the opportunities for future work.
Authors:
(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);
(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).
This paper is