Table of Links
Abstract and 1. Introduction
1.1 Printing Press in Iraq and Iraqi Kurdistan
1.2 Challenges in Historical Documents
1.3 Kurdish Language
-
Related work and 2.1 Arabic/Persian
2.2 Chinese/Japanese and 2.3 Coptic
2.4 Greek
2.5 Latin
2.6 Tamizhi
-
Method and 3.1 Data Collection
3.2 Data Preparation and 3.3 Preprocessing
3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation
-
Experiments, Results, and Discussion and 4.1 Processed Data
4.2 Dataset and 4.3 Experiments
4.4 Results and Evaluation
4.5 Discussion
-
Conclusion
5.1 Challenges and Limitations
Online Resources, Acknowledgments, and References
This section reviews the literature by focusing on machine-typed historical documents. To the best of our knowledge, currently, there is no OCR system that can accurately extract text from old Kurdish publications written in Arabic-Persian script. Therefore, we concentrate on the related work for other languages.
2.1 Arabic/Persian
It is difficult to implement an Ottoman character recognition system according to Ozturk et al. (2000). There are insufficient studies in this field. Therefore, they developed a model using artificial neural networks using 28 different Ottoman machine-printed documents in order to develop an OCR that will recognize different fonts. Three Ottoman newspapers were used to prepare their data. For documents with a trained font, the accuracy was 95%, while for documents with an unknown font, it was 70%.
According to Ataer and Duygulu (2007), it may not be possible to obtain satisfactory results using character recognition-based systems due to the characteristics of Ottoman documents. Moreover, it is desirable to store the documents as images, since the documents may contain important drawings, especially signatures. The author viewed Ottoman words as images and proposed a matching technique to solve the problem because of these reasons. According to the author, the bag-of-visterm approach was shown to be successful in classifying objects and scenes, which is why he adopted the same approach for matching word images. Using vector quantization of Scale-Invariant Feature Transform (SIFT) descriptors, word images were represented by sets of visual terms. By comparing the visual terms’ distributions, similar words are then matched. Over 10,000 words were included in the printed and handwritten documents used in the experiments. In the experiment, the highest accuracy was 91% and the lowest accuracy was 30%.
Kilic et al. (2008) developed an OCR system specifically designed for Ottoman script segmentation, normalization, edge detection, and recognition. The Ottoman characters were categorized into four distinct forms based on their position within a word: beginning, middle, end, and isolated form. Images of printed papers containing Ottoman script were used for data acquisition. The process involved segmentation and normalization of the images, followed by edge detection using Cellular Neural Networks for feature extraction. Subsequently, a Support Vector Machine (SVM) was employed to accurately identify these multi-font Ottoman characters. The SVM training involved the utilization of Polynomial (linear and quadratic) and Gaussian Radial Basis Function kernels. The proposed recognition system achieved an impressive accuracy rate of 87.32 percent for character classification.
Shafii (2014) proposed a new technique in two important preprocessing steps, skew detection and page segmentation, after reviewing the existing technology. Instead of utilizing the usual practice of segmenting characters, they suggested segmenting subwords to avoid challenges with segmentation due to Persian script’s highly cursive nature. Feature extraction was implemented using a hybrid scheme that combines three commonly used methods before being classified using a nonparametric method. Based on their experimental tests on a library of 500 words, they were able to recognize 97% of the words.
Due to the challenges of the Arabic heritage collection, which consists of early prints and manuscripts, it is difficult to extract text from its documents. To address these problems, Stahlberg and Vogel (2016) developed a system called QATIP (QCRI Qatar Computing Research Institute Arabic Text Image Processing) to OCR these kind of documents. A sophisticated text-to-image binarization technique was used in conjunction with Kaldi, which was originally designed for speech recognition. This paper contributed two major areas, one involving the creation of both a graphical user interface for users as well as API endpoints for integration and the other new approaches to model language and ligatures. After testing the system, they found out that the newly proposed technique for language modeling and ligature modeling was highly successful. The accuracy of the system was 37.5% WER 12.6% CER for early books.
In order to recognize Ottoman-Turkish characters, Do˘gru (2016) used Tesseract optical character recognition system. In addition, various transcription methods have been developed from Ottoman Turkish to Latin. Optical character recognition could not recognize certain OttomanTurkish characters. As a result, Ottoman-Turkish keyboards were developed to facilitate the writing of unrecognized characters using Ottoman-Turkish alphabets. For the transcription process, dictionary tables were used. This resulted in an increase in the success rate of transcription when enrichment data was included in the dictionary tables. Therefore, an application was developed to enrich dictionary tables with data. The recognition rates for the first two pages of an Ottoman book was between 75.88% – 77.38%. Based on the results of the author’s experiments, he concludes that recognition rates could vary based on quality, style, and printed or handwritten documents or images. High quality and printed images can be recognized with a 100% accuracy rate, while handwritten and low-quality documents or images cannot be recognized by optical character recognition. It is therefore necessary to write these kinds of documents or images again in Ottoman-Turkish.
Analytical based approach for cursive scripts such as Arabic can be very challenging, especially for segmentation, because of the frequent overlapping between the characters. Because of that Nashwan et al. (2017) proposed a segmentation-based holistic approach to solve this issue. Since we deal with the entire word as a single unit in the holistic approach, this will improve the error rate for cursive scripts. But on the other hand, it will require computation complexity especially if the application has a huge vocabulary. In their view, their holistic approach, based on Discrete Cosine Transforms (DCTs) and local block features, will be computationally efficient. In addition, they developed a method for reducing the length of the lexicon by clustering words that have similar shapes. The proposed system was tested on a wide range of datasets, and it was found to have a 47.8% WRR accuracy, and it increased to 65.7% WRR when considering the top-10 hypotheses.
By employing deep convolutional neural networks, K¨u¸c¨uk¸sahin (2019) devised an offline OCR system that demonstrates the ability to recognize Ottoman characters. The proposed methodology encompasses multiple stages, including image processing, image digitization, character segmentation, adaptation of inputs for the network, network training, recognition, and evaluation of outcomes. To create a character dataset, text images of varying lengths were segmented from diverse samples of Ottoman literature obtained from the Turkish National Library’s digital repository. Two convolutional neural networks of differing complexity were trained using the generated character dataset, and the correlation between recognition rates and network complexity was examined. The dataset’s features were extracted through the Histogram of Oriented Gradients and Principal Component Analysis techniques, while classification of Ottoman characters was achieved using the widely employed k-Nearest Neighbor Algorithm and Support Vector Machines. Results from the conducted analyses revealed that both networks exhibit recognition rates comparable to traditional classifiers; however, the more intricate deep neural network outperformed others in terms of accuracy and loss. After 100 epochs, the most accurate model achieved an impressive accuracy of 97.58 percent.
Dolek and Kurt (2021) presented an OCR tool developed for printed Ottoman documents in Naskh font. The tool was developed using a deep learning model trained with data sets containing both original and synthetic documents. The model was compared with free and opensource OCR engines using a test dataset comprising 21 pages of original documents. In terms of accuracy rates, their model outperformed the other tools with 88.64% raw, 95.92% normalized, and 97.18% joined. Additionally, their model achieved an accuracy rate of 58 percent for word recognition, which is the only rate above 50 percent among the OCR tools that were compared.
Authors:
(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);
(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).
This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.