Advances In OCR For Historical Chinese, Japanese, Coptic, And Greek Texts

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

Related work and 2.1 Arabic/Persian

2.2 Chinese/Japanese and 2.3 Coptic

2.4 Greek

2.5 Latin

2.6 Tamizhi
Method and 3.1 Data Collection

3.2 Data Preparation and 3.3 Preprocessing

3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation
Experiments, Results, and Discussion and 4.1 Processed Data

4.2 Dataset and 4.3 Experiments

4.4 Results and Evaluation

4.5 Discussion
Conclusion

5.1 Challenges and Limitations

Online Resources, Acknowledgments, and References

2.2 Chinese/Japanese

Historical Chinese characters have posed one of the greatest challenges in pattern recognition in the past. This is due to their large character set and various writing styles. To address this issue Li et al. (2014) proposed a method of recognizing historical Chinese characters by incorporating STM into an MQDF classifier. The experiment was conducted on historical documents from Dunhuang and traditional Chinese fonts. The optimal selection of parameters was selected after testing many different parameters. They conducted two separate sets of experiments. An experiment using printed traditional Chinese characters was conducted as part of the first set of experiments. For the second experiment, samples taken from historical Chinese documents were used to perform the experiments. In addition, the method may be improved by introducing nonlinear transfer and integrating it with other approaches. Furthermore, the system was tested with a variety of features and classifiers. The results of experiments suggest that supervised STMs may improve the generalization of classifiers. As a result of the results, the error rate was reduced by a considerable amount and the method showed significant potential. For example, it is possible to reduce the error rate of one of the tested documents by 60% by tagging only 10% of the samples with labels.

The lack of labeled training samples makes recognition of historical Chinese characters very challenging. Therefore, Feng et al. (2015) proposed a non-linear Style Transfer Mapping (STM) model based on Gaussian Process (GP-STM), which extends the traditional linear STM model. By using GP-STM, existing printed samples of Chinese characters were used to recognize historical Chinese characters. To prepare the GP-STM framework, the researchers compared a number of methods for extracting features, trained a Modified Quadratic Discriminant Function (MQDF) classifier on examples of Chinese characters printed on paper, and then applied the model to historical documents from Dunhuang. The impact of different kernels and parameters was measured, in addition to the impact of the number of training samples. In the experiments, the results indicate that the GP-STM is capable of achieving an accuracy of 57.5%, an improvement of over 8% over the STM.

It is difficult to recognize Chinese characters directly by using classical methods when they appear in historical documents since they can be categorized into more than 8000 different categories. Due to the lack of well-labeled data, deep learning based methods are unable to recognize them. The authors of Yang et al. (2018) presented a historical Chinese text recognition algorithm based on data that was labeled at the page level without aligning each line of text. In order to reduce the influence of misalignment between text line images and labels, they proposed Adaptive Gradient Gate (AGG). The proposed text recognizer can reduce its error rate by over 35 percent with the help of AGG. Furthermore, they found that establishing an implicit language model using Convolutional Neural Networks (CNNs) and Connectionist Temporal Classification (CTC) is one of the key factors in achieving high recognition performance. With an accuracy rate of 94.64%, the proposed system outperformed other optical character recognition systems.

Deep reinforcement learning has found successful applications across various fields. Sihang et al. (2020) presented an innovative approach, based on deep reinforcement learning, to enhance the F-measure score for Chinese character detection in historical documents. Their method introduced a novel fully convolutional network called fully convolutional network with position-sensitive Region-of-Interest (RoI) pooling FCPN. Unlike fixed-size character patches, this network could accommodate patches of varying sizes and incorporate positional information into action features. Additionally, they proposed a Dense Reward Function (DRF) that effectively rewarded different actions based on environmental conditions, thereby enhancing the decisionmaking capability of the agent. The method was designed to be applicable to the output of character-level or word-level text detectors, resulting in more precise outcomes. The effectiveness of their approach was demonstrated through its application to the Tripitaka Koreana in Han (TKH) and Multiple Tripitaka in Han (MTH) datasets, where a notable improvement was observed, achieving an Intersection over Union (IoU) of 0.8.

The introduction of ARCED by Ly et al. (2020) presents a novel attention-based row-column encode-decoder model for recognizing multiple text lines in images without requiring explicit line segmentation. The recognition system comprises three key components: a feature extractor, a row-column encoder, and a decoder. By adopting an attention-based seq2seq approach, the proposed model achieves significantly lower error rates compared to previous state-of-the-art methods for both single and multiple text line recognition. The encoder component leverages a row-column Bidirectional Long Short-Term Memory (BLSTM) network, enabling the capture of sequential order information in both horizontal and vertical dimensions. This contributes to further reducing error rates within the attention-based model. Additionally, a residual LSTM network utilizes all prior attention vectors to generate predictive distributions in the decoder, leading to improved accuracy. Training of the entire system is conducted using a cross-entropy loss function, utilizing only document images and ground-truth text. To evaluate the performance of ARCED, the Kana-PRMU dataset, comprising Japanese historical documents, is employed. Experimental results demonstrate that ARCED outperforms existing recognition methods. Specifically, when evaluated on the level 2 and level 3 subsets of the Kana-PRMU dataset, the proposed ARCED model achieves character error rates of 4.15% and 12.69% respectively. Future work aims to enhance ARCED’s capabilities for recognizing entire Japanese document pages. Furthermore, incorporating a language model into ARCED is anticipated to further enhance its performance.

2.3 Coptic

According to Bulert et al. (2017) due to non-standard fonts and varying paper and font quality, OCR results may not be satisfactory when applied to historical texts. Further, historical texts are not transmitted in their entirety over time, but rather include gaps and fragments. As a result, automatic post-correction is more difficult when it comes to historical texts than when it comes to modern texts. Two tools were used to create recognition patterns (or models) specific to different languages and documents to recognize printed Coptic texts. Historically, Coptic was the last stage in the development of the pre-Arabic language that was indigenous to Egypt. In addition, it led to the creation of a rich and unique body of literature, including monastic texts, Gnostic texts, Manichaean texts, magical texts, and translations of the Bible and patristic texts. According to the researchers, Coptic texts possess properties that make them excellent candidates for computer-based reading. As a result of their limited number and the fact that most handwritten texts exhibit highly consistent forms, the characters can easily be identified.

2.4 Greek

A study published by Simistira et al. (2015) investigated the performance of LSTM for inputting Greek polytonic script in OCR. Even though there are many Greek polytonic manuscripts, digitization of such documents has not been widely applied, and very little work has been done on the recognition of these scripts. In this study, they collected many diverse Greek polytonic script pages into a novel database, called Polyton-DB, containing 15,689 text lines of synthetic and authentic printed scripts, and conducted baseline experiments with LSTM networks. LSTM is shown to have an error rate between 5.51 and 14.68 percent (depending on the document) and is better than Tesseract and ABBYY FineReader, two well-known OCR engines.

It is not possible to recognize Greek characters in early printed Greek books using traditional character recognition techniques. Because the writing of the same or consecutive words does not permit character or word segmentation, the character or word cannot be segmented. To address this issue, Poulos et al. (2010) has developed a novel OCR system combining image preprocessing with computational geometry. Their objective was to perform OCR digitization of a large collection of digitized Greek early printed books dating from the late 15th century to the mid-18th century. In this method, image processing is performed through the use of image binarization and enhancement, the creation of a convex polygon that represents the feature extraction of each font, and the development of training and identification procedures based on algorithms for intersecting convex polygons. Among the major advantages of this method was the ability to control the authentication of an image of a published document or a partial modification of it to a high degree of reliability. Therefore, the proposed system uses smart geometric practices to determine the classification of a candidate letter. According to experimental results, the proposed method yields positive and negative verification scores that are greater than 92% correct.

Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).

This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

Advances in OCR for Historical Chinese, Japanese, Coptic, and Greek Texts | HackerNoon

Table of Links

2.2 Chinese/Japanese

2.3 Coptic

2.4 Greek

Leave a Reply Cancel reply

Stay Connected

Latest News

Apple Plans OLED Display Upgrade For iPad mini, iPad Air, And MacBook Air: Here’s When To Expect

A Simpler Formula for Curve Approximation Using Arc Segments | HackerNoon

How Vibration-Based Speakers Work and Why Apple Might Use Them in the iPad Mini 8

Should AI Superintelligence Research Continue? Why These Public Figures Want a Pause

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

2.2 Chinese/Japanese

2.3 Coptic

2.4 Greek

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News