By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Can AI Finally Crack Ottoman Text Recognition? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Can AI Finally Crack Ottoman Text Recognition? | HackerNoon
Computing

Can AI Finally Crack Ottoman Text Recognition? | HackerNoon

News Room
Last updated: 2025/08/18 at 10:00 PM
News Room Published 18 August 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

This section reviews the literature by focusing on machine-typed historical documents. To the best of our knowledge, currently, there is no OCR system that can accurately extract text from old Kurdish publications written in Arabic-Persian script. Therefore, we concentrate on the related work for other languages.

2.1 Arabic/Persian

It is difficult to implement an Ottoman character recognition system according to Ozturk et al. (2000). There are insufficient studies in this field. Therefore, they developed a model using artificial neural networks using 28 different Ottoman machine-printed documents in order to develop an OCR that will recognize different fonts. Three Ottoman newspapers were used to prepare their data. For documents with a trained font, the accuracy was 95%, while for documents with an unknown font, it was 70%.

According to Ataer and Duygulu (2007), it may not be possible to obtain satisfactory results using character recognition-based systems due to the characteristics of Ottoman documents. Moreover, it is desirable to store the documents as images, since the documents may contain important drawings, especially signatures. The author viewed Ottoman words as images and proposed a matching technique to solve the problem because of these reasons. According to the author, the bag-of-visterm approach was shown to be successful in classifying objects and scenes, which is why he adopted the same approach for matching word images. Using vector quantization of Scale-Invariant Feature Transform (SIFT) descriptors, word images were represented by sets of visual terms. By comparing the visual terms’ distributions, similar words are then matched. Over 10,000 words were included in the printed and handwritten documents used in the experiments. In the experiment, the highest accuracy was 91% and the lowest accuracy was 30%.

Kilic et al. (2008) developed an OCR system specifically designed for Ottoman script segmentation, normalization, edge detection, and recognition. The Ottoman characters were categorized into four distinct forms based on their position within a word: beginning, middle, end, and isolated form. Images of printed papers containing Ottoman script were used for data acquisition. The process involved segmentation and normalization of the images, followed by edge detection using Cellular Neural Networks for feature extraction. Subsequently, a Support Vector Machine (SVM) was employed to accurately identify these multi-font Ottoman characters. The SVM training involved the utilization of Polynomial (linear and quadratic) and Gaussian Radial Basis Function kernels. The proposed recognition system achieved an impressive accuracy rate of 87.32 percent for character classification.

Shafii (2014) proposed a new technique in two important preprocessing steps, skew detection and page segmentation, after reviewing the existing technology. Instead of utilizing the usual practice of segmenting characters, they suggested segmenting subwords to avoid challenges with segmentation due to Persian script’s highly cursive nature. Feature extraction was implemented using a hybrid scheme that combines three commonly used methods before being classified using a nonparametric method. Based on their experimental tests on a library of 500 words, they were able to recognize 97% of the words.

Due to the challenges of the Arabic heritage collection, which consists of early prints and manuscripts, it is difficult to extract text from its documents. To address these problems, Stahlberg and Vogel (2016) developed a system called QATIP (QCRI Qatar Computing Research Institute Arabic Text Image Processing) to OCR these kind of documents. A sophisticated text-to-image binarization technique was used in conjunction with Kaldi, which was originally designed for speech recognition. This paper contributed two major areas, one involving the creation of both a graphical user interface for users as well as API endpoints for integration and the other new approaches to model language and ligatures. After testing the system, they found out that the newly proposed technique for language modeling and ligature modeling was highly successful. The accuracy of the system was 37.5% WER 12.6% CER for early books.

In order to recognize Ottoman-Turkish characters, Do˘gru (2016) used Tesseract optical character recognition system. In addition, various transcription methods have been developed from Ottoman Turkish to Latin. Optical character recognition could not recognize certain OttomanTurkish characters. As a result, Ottoman-Turkish keyboards were developed to facilitate the writing of unrecognized characters using Ottoman-Turkish alphabets. For the transcription process, dictionary tables were used. This resulted in an increase in the success rate of transcription when enrichment data was included in the dictionary tables. Therefore, an application was developed to enrich dictionary tables with data. The recognition rates for the first two pages of an Ottoman book was between 75.88% – 77.38%. Based on the results of the author’s experiments, he concludes that recognition rates could vary based on quality, style, and printed or handwritten documents or images. High quality and printed images can be recognized with a 100% accuracy rate, while handwritten and low-quality documents or images cannot be recognized by optical character recognition. It is therefore necessary to write these kinds of documents or images again in Ottoman-Turkish.

Analytical based approach for cursive scripts such as Arabic can be very challenging, especially for segmentation, because of the frequent overlapping between the characters. Because of that Nashwan et al. (2017) proposed a segmentation-based holistic approach to solve this issue. Since we deal with the entire word as a single unit in the holistic approach, this will improve the error rate for cursive scripts. But on the other hand, it will require computation complexity especially if the application has a huge vocabulary. In their view, their holistic approach, based on Discrete Cosine Transforms (DCTs) and local block features, will be computationally efficient. In addition, they developed a method for reducing the length of the lexicon by clustering words that have similar shapes. The proposed system was tested on a wide range of datasets, and it was found to have a 47.8% WRR accuracy, and it increased to 65.7% WRR when considering the top-10 hypotheses.

By employing deep convolutional neural networks, K¨u¸c¨uk¸sahin (2019) devised an offline OCR system that demonstrates the ability to recognize Ottoman characters. The proposed methodology encompasses multiple stages, including image processing, image digitization, character segmentation, adaptation of inputs for the network, network training, recognition, and evaluation of outcomes. To create a character dataset, text images of varying lengths were segmented from diverse samples of Ottoman literature obtained from the Turkish National Library’s digital repository. Two convolutional neural networks of differing complexity were trained using the generated character dataset, and the correlation between recognition rates and network complexity was examined. The dataset’s features were extracted through the Histogram of Oriented Gradients and Principal Component Analysis techniques, while classification of Ottoman characters was achieved using the widely employed k-Nearest Neighbor Algorithm and Support Vector Machines. Results from the conducted analyses revealed that both networks exhibit recognition rates comparable to traditional classifiers; however, the more intricate deep neural network outperformed others in terms of accuracy and loss. After 100 epochs, the most accurate model achieved an impressive accuracy of 97.58 percent.

Dolek and Kurt (2021) presented an OCR tool developed for printed Ottoman documents in Naskh font. The tool was developed using a deep learning model trained with data sets containing both original and synthetic documents. The model was compared with free and opensource OCR engines using a test dataset comprising 21 pages of original documents. In terms of accuracy rates, their model outperformed the other tools with 88.64% raw, 95.92% normalized, and 97.18% joined. Additionally, their model achieved an accuracy rate of 58 percent for word recognition, which is the only rate above 50 percent among the OCR tools that were compared.

Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Plague of ‘zombie squirrels’ swarm gardens with oozing pus-filled sores
Next Article How to Send Apple Pay to a Group Chat on Your iPhone & iPad
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Tesla’s No. 2 executive reportedly reassuming China leadership role · TechNode
Computing
What is Roblox? Everything you need to know
News
Sigma 200mm F2 DG OS Sports Review: This Bright Telephoto Prime Readily Blurs Out Backgrounds
News
Xiaomi set to unveil CIVI 4 Pro Disney Princess Edition · TechNode
Computing

You Might also Like

Computing

Tesla’s No. 2 executive reportedly reassuming China leadership role · TechNode

1 Min Read
Computing

Xiaomi set to unveil CIVI 4 Pro Disney Princess Edition · TechNode

1 Min Read
Computing

Vivo set to launch X100s series with industry-leading camera on May 13 · TechNode

1 Min Read
Computing

Chinese EV makers “very welcome” to open plants in France, says minister · TechNode

2 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?