By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Training Tesseract OCR on Kurdish Historical Documents | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Training Tesseract OCR on Kurdish Historical Documents | HackerNoon
Computing

Training Tesseract OCR on Kurdish Historical Documents | HackerNoon

News Room
Last updated: 2025/08/19 at 11:16 PM
News Room Published 19 August 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

1.1 Printing Press in Iraq and Iraqi Kurdistan

1.2 Challenges in Historical Documents

1.3 Kurdish Language

  1. Related work and 2.1 Arabic/Persian

    2.2 Chinese/Japanese and 2.3 Coptic

    2.4 Greek

    2.5 Latin

    2.6 Tamizhi

  2. Method and 3.1 Data Collection

    3.2 Data Preparation and 3.3 Preprocessing

    3.4 Environment Setup, 3.5 Dataset Preparation, and 3.6 Evaluation

  3. Experiments, Results, and Discussion and 4.1 Processed Data

    4.2 Dataset and 4.3 Experiments

    4.4 Results and Evaluation

    4.5 Discussion

  4. Conclusion

    5.1 Challenges and Limitations

    Online Resources, Acknowledgments, and References

4 Experiments, Results, and Discussion

Initially, we collected some historical publications from the Zaytoon Public Library in Erbil. However, due to the fragile condition of the documents, it was not easy to transfer them into digital format. Then, via the internet, we found the Zheen Center for Documentation and Research in Sulaymaniyahn https://zheen.org, a facility specializing in scanning and archiving historical documents using unique technologies explicitly designed for that function. After visiting them and explaining our project, they agreed to provide us with digital copies of the earliest Kurdish publications they had in their collection.

4.1 Processed Data

To handle image processing tasks, we utilized a dedicated batch processing tool that was freely available. With this tool, we loaded the images and applied a de-skewing process to correct any skew present in the images. We also performed automatic cropping and converted the images to binary format, saving them in the specified destination directory.

4.2 Dataset

After receiving the historical documents from Zheen Center for Documentation and Research in a digital format, we converted the pages into single-line images with respected transcription for the line. We used an Image Processing application to crop lines and saved them in TIFF format.

After converting the pages into image lines (See Figure 16), we created transcription files for each image line using a text editing program by manually typing what is written in the images.

Figure 15: Sample page in the book titled ’Awat’ published in 1938 (Zheen Center for Documentation and Research)Figure 15: Sample page in the book titled ’Awat’ published in 1938 (Zheen Center for Documentation and Research)

We named the transcription files the same name as the image line with (.gt.txt) postfix (See Figure 17).

This way, the dataset for training Tesseract was created, which resulted in 1233 files. Half are the image lines, and the other is the transcription files (See Table 1).

4.3 Experiments

In this section, we provide details of the steps taken to prepare our environment, the training process of the model, and other relevant aspects.

4.3.1 Environment Setup

For this training environment, we used Ubuntu 22.04.2 LTS (Jammy Jellyfish). We cloned the tesstrain from https://github.com/tesseract-ocr/tesstrain and we trained the model using our prepared dataset.

Authors:

(1) Blnd Yaseen, University of Kurdistan Howler, Kurdistan Region – Iraq ([email protected]);

(2) Hossein Hassani University of Kurdistan Howler Kurdistan Region – Iraq ([email protected]).


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-NODERIVS 4.0 INTERNATIONAL license.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Best TV Deals
Next Article Today's NYT Mini Crossword Answers for Aug. 20 – CNET
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Africa’s agricultural future depends on moving food, not just growing it
Computing
Is Call of Duty: Black Ops 7 just another ‘lazy’ addition to the franchise?
News
How to get rid of Spotify and make your own music streaming service
News
With iPhone 17, Apple Reduces Its Dependency on China
Computing

You Might also Like

Computing

Africa’s agricultural future depends on moving food, not just growing it

8 Min Read
Computing

With iPhone 17, Apple Reduces Its Dependency on China

0 Min Read
Computing

Anthropic Vs Perplexity: Which AI Tool Is Right for You? |

27 Min Read
Computing

What’s Next for Crypto? Web3 VC founder James Wo on the Future of Digital Finance | HackerNoon

8 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?