By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Building OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Building OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoon
Computing

Building OCR Systems for Tamizhi and Kurdish Historical Documents | HackerNoon

News Room
Last updated: 2025/08/20 at 12:28 AM
News Room Published 20 August 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

2.6 Tamizhi

Based on Munivel and Enigo (2022), digitizing documents from ancient history typically involves OCR. However, OCR for Tamizhi documents poses significant challenges due to the inherent similarities in shape and structure among many characters, along with their subtle variations. The Tamizhi script, also known as Tamil-Brahmi, serves as the precursor to numerous modern Indian scripts and is recognized as one of the oldest scripts in India. Developing an OCR system for Tamizhi script is exceptionally difficult due to the abundance of combined characters, where a character can consist of a single vowel, consonant, or a combination of both. In their research paper, the authors discuss their efforts in creating an OCR system specifically designed for printed Tamizhi documents. The system aims to perform effectively despite various factors, including the poor quality of the documents, the presence of noise, and the diverse formats of the input data. The authors report that their Tamizhi OCR achieves an accuracy rate of 91.12 percent for printed text, demonstrating promising results in recognizing Tamizhi characters.

To summarize, we can mention that up to the time we publish this research, the literature does not report on any efforts made to specifically develop OCR for historical Kurdish documents. Also currently no accessible dataset is available to train OCR systems that are specifically designed to extract text from historical Kurdish documents. That significantly restricts our options when it comes to selecting the most suitable approach for our study.

To develop an OCR system specifically tailored for historical documents, researchers employed different techniques and strategies such as SVM, LSTM, and CNN. The variability in the obtained results, which reached a maximum of 99.7% CLA, can be attributed to several contributing factors. These factors include the quality of the dataset used, the specific methodology employed during the development of the OCR system, and the intrinsic complexity of the documents being processed.

The studies that were reviewed in this chapter employed both proprietary datasets that were created by researchers themselves and publicly available datasets. These datasets include TWDB, HWDB, GT4HistOCR, Stockholm Archive, Dunhuang data, Tripitaka, TKH, MTH, and Kana-PRMU. According to the literature in this field, there are ongoing efforts to improve OCR techniques for different kinds of historical documents.

Based on our research, we identified that LSTM is a widely adopted approach for developing OCR systems with acceptable accuracy. As a result, we used the latest version of Tesseract, which integrates LSTM functionality, to ensure optimal performance in our project research. Additionally, we discovered the availability of pre-trained models that can be used for fine tuning on our dataset. Recognizing the similarities between the Kurdish and Arabic scripts, we made the decision to use an Arabic pre-trained model as our base model.

Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Hey Alexa, Check Out This Deal: Save 24% Off an Amazon Echo Dot
Next Article See Six Planets Line Up in the Upcoming Planet Parade Tonight
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

BYD, FAW to invest in DJI auto business in race for self-driving cars · TechNode
Computing
Drag x Drive review – wheelchair basketball that might give you wrist cramp
News
‘Kirby Air Riders’ is a ‘Super Smash’-style racer that triggers all the good brain chemicals
News
2023 TechNode Content Team Annual Insights: AI to push humanity into a new era · TechNode
Computing

You Might also Like

Computing

BYD, FAW to invest in DJI auto business in race for self-driving cars · TechNode

5 Min Read
Computing

2023 TechNode Content Team Annual Insights: AI to push humanity into a new era · TechNode

7 Min Read
Computing

NCC maintains 70% broadband target for year-end

5 Min Read
Computing

Baidu’s AI model powers Samsung’s new smartphone in China · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?