By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Can AI Save Centuries of Kurdish History? | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Can AI Save Centuries of Kurdish History? | HackerNoon
Computing

Can AI Save Centuries of Kurdish History? | HackerNoon

News Room
Last updated: 2025/08/18 at 11:13 PM
News Room Published 18 August 2025
Share
SHARE

Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

Abstract

Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish language’s resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseract’s built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

1 Introduction

Over the course of centuries, human experience has produced invaluable treasures in the form of historical documents. Due to the large amount of work required for manual annotation and transcription of historical documents, many archives of historical documents remain inaccessible (Ataer and Duygulu, 2007). Through digitization, these documents can be understood and protected efficiently and effectively. In this process, actual documents are systematically converted into digital records based on the precise recognition of characters in the original document (Yang et al., 2018). Because of the demand for maintaining and making historical documents available for research without damaging physical copies, many languages and regions started practicing and studying digitization and preservation of the digital reproduction of historical documents (Nguyen et al., 2017). According to Poncelas et al. (2020), building Optical Character Recognition (OCR) that recognizes and extracts text from historical documents is a challenging task, and some unique sets of issues can affect the result of the model. Typeface inconsistency and bad-quality images are some examples of the challenges. Figure 1 is a sample page with these challenges. As a result of these issues, most of the advanced OCR systems produce errors which is why researchers continue their efforts to find new methods to enhance the OCR engines to generate better output.

Figure 1: A sample page from the book titled ’Deste Gullˆı Lawane’ published in 1939 (Zheen Center for Documentation and Research).Figure 1: A sample page from the book titled ’Deste Gullˆı Lawane’ published in 1939 (Zheen Center for Documentation and Research).

Initially, historical documents were painstakingly created by hand, leading to their restricted availability and limited distribution. However, the introduction of the printing press by Johannes Gutenberg in 1436 in Germany marked a significant milestone. The printing press, a mechanical device designed for printing high-volume publications, revolutionized the production of historical documents. This apparatus applies pressure on an inked surface, as depicted in Figure ??. The printing press is widely recognized as one of the most remarkable accomplishments in history, facilitating the widespread dissemination and preservation of knowledge (Qania, 2012).

As for the Kurdish press history, it is about one century old, and the devices used for printing were hugely different from what we have today. The devices underwent many changes and improvements until we reached what we have today.

Publications printed with the printing press have various issues. One of them is the lack of standard font for writing, the use of many Arabic styles, and on top of them, all the books need to be in better shape as they are very fragile and damaged and there are many noticeable marks on them.

A few OCR systems currently support the Kurdish language, for example, the one by Idrees and Hassani (2021). Still, they cannot recognize these old publications due to the abovementioned issues. As for the old publications, some works have been done for the other languages that we go through in the literature review chapter.

This study focuses on enhancing an existing OCR system for the Kurdish language so we can recognize and extract text from historical Kurdish documents, which makes the related documents ready for further processing.

This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article DreamCloud finally launches a mattress topper — and it’s both cooling and customizable
Next Article Today's NYT Wordle Hints, Answer and Help for Aug. 19 #1522 – CNET
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Messages between Android, iPhones could get end-to-end encryption soon
News
Top 11 Brand Awareness Tools to Grow Your Visibility in 2025
Computing
Apple won’t have to build backdoor for UK access to American data – 9to5Mac
News
End-To-End Encryption For RCS Messaging On iPhone Could Arrive In iOS 26, Beta Code Suggests – BGR
News

You Might also Like

Computing

Top 11 Brand Awareness Tools to Grow Your Visibility in 2025

35 Min Read
Computing

Luckin Coffee records first quarterly loss in two years, negative operating margin · TechNode

1 Min Read
Computing

👨🏿‍🚀 Daily – MTN’s CEO roams into new territory |

20 Min Read
Computing

How to Measure Your Instagram Story Metrics in 2025

12 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?