When talking about text and document detection, newspaper article detection is rightfully considered one of the most difficult tasks. The lack of standardization, the random article structure, and the abundance of headers, subheaders, and images make accurate article extraction challenging.
Having worked on multiple projects digitizing both Western and Asian newspapers, I can confidently say me and my team have explored at least a dozen approaches to extracting newspaper articles.
In this article, I explore 3 approaches to newspaper extraction: from the most simple and cost-effective to the most accurate. Here we go.
Approach #1: GPT-4o
The easiest (and least expensive) approach to digitizing virtually any document is by using GPT-4o to analyze newspaper pages.
First things first, we need to detect article coordinates and extract text. There are plenty of ways to extract text from a page, but the important part here is to extract both text and its coordinates, i.e. its position on the page. This information helps reflect the text structure after it is extracted using tabs and lines.
Using OCR and Azure Document Intelligence, we can extract the text and its coordinates, after which we restore the structure of the text, thus formatting it in a way that best reflects the page structure. This step prepares the data for analysis by the LLM.
Next, we send the text to GPT-4o and ask it to split the text into separate articles. GPT searches for large semantic structures, on the basis of which it determines where one article ends and another begins.
Lastly, we restore the text structure based on its coordinates to splice the page into separate articles.
This approach, while relatively cheap and fast to develop, is not the most accurate: we’ve noticed 10-15% of articles were not detected accurately, with borders that are either too big or too small.
This level of accuracy may be good enough in certain situations, for example, for startups looking to attract investments: an MVP with 90% accuracy is more often than not enough for investor presentations.
However, when looking to improve the results, we need to change the approach. The omnipotent LLMs have their applications in document processing, but for highly accurate article recognition we need to use the big guns.
Approach #2: Segmentation Model
In this approach, we begin by applying a segmentation model to identify distinct blocks of text and images on a newspaper page.
Each block will be enclosed in a segmentation mask with precise coordinates. Similarly, images on the page will also be enclosed in bounding boxes as part of the segmentation process.
After segmentation, we will use Azure DI Read Model to extract the text contained within each identified block.
We will then calculate embedding vectors for both the text and the images using appropriate pre-trained models (e.g., BERT for text and ResNet for image embeddings).
These embeddings will capture the semantic meaning of the content within each block.
Next, we will employ clustering algorithms to group the blocks and images into coherent articles. The clustering will be based not only on the embedding vectors but also on additional parameters, such as the relative spatial positions of the blocks on the page.
This multi-dimensional clustering will allow us to treat adjacent or semantically related blocks as part of the same article. If the clustering results are inaccurate, we will apply post-processing algorithms to refine the results by considering additional contextual cues.
Once the articles are identified, we will crop the page according to the block boundaries of the clustered articles. For each article, we will save the cropped region as a PNG image, the extracted text, the embedding vector of the article, and a generated summary.
Workflow
-
PDF Input: The process begins by ingesting a PDF of a newspaper.
-
Segmentation: Apply a segmentation model (e.g.,YOLOv8- seg) to detect and mark individual blocks (text and images) on the page, providing bounding boxes for each
-
OCR and Embedding Extraction:
-
For each text block, Azure’s OCR extracts the text.
-
Text embeddings are computed using models like BAAI/bge-m3 to represent the semantic content.
-
Image embeddings are calculated using models like ResNet.
-
A clustering algorithm such as DBSCAN, Agglomerative Clustering, or K-means is used to group the text and image blocks into articles. The algorithms utilize both the embedding vectors and spatial relationships (e.g., relative coordinates) of the blocks.
-
-
Algorithms: If clustering results are imprecise, a post-processing step involving contextual rules or heuristic-based methods is applied to adjust the groupings.
-
Article Cropping and Output:
- Once articles are identified, the system crops the page based on the bounding boxes of each clustered article.
- For each article, the following outputs are generated:
-
PNG: A cropped image of the article,
-
Text: The full text extracted via OCR,
-
Embedding: The combined embedding vector for the text (It will be useful if we search for relevant articles in the database on request in the future).
-
Approach #3: Segmentation Model For Instant Article Detection
In this approach, instead of segmenting the page into individual blocks of text and images, we directly detect complete articles on the newspaper page.
This method relies on a trained segmentation model specifically designed to identify articles as coherent units, which include both text and images.
It’s important to note this approach requires a larger marked-up dataset.
Once the articles are detected, we can use OCR (using Azure’s OCR service from the Azure DI Read Model) to extract the full text within each detected article. Similar to the first approach, text embeddings are computed.
Following text extraction and embedding calculation, each article is processed for output, which includes saving a cropped image of the article, the extracted text, the embedding vector representing the article, and an auto-generated summary.
Workflow:
-
Upload input PDF of the newspaper.
-
Apply a segmentation model (e.g., YOLO, or DETR) to detect entire articles on the page, providing a segmentation mask around each article. This is a much more complex task compared to the segmentation model from the previous approach. It’s vital to prepare a large dataset with correctly marked-up articles.
-
OCR and Embedding Logic:
-
For each detected article, Azure’s OCR is used to extract the full text.
-
Text embeddings are computed using models like BAAI/bge-m3.
-
-
Algorithms: If detection results are imprecise, a post-processing step involving contextual rules or heuristic-based methods is applied to adjust the groupings.
-
Article Cropping and Output:
- For each detected article, the system will crop the page based on the bounding box.
- The following outputs will be generated for each article:
-
PNG: A cropped image of the entire article.
-
Text: The full text was extracted via OCR.
-
Embedding: The combined embedding vector for the article’s text and images.
-
Summing Up
- For a quick prototype or an MVP: combine the power of OCR algorithms and an LLM like GPT-4o to detect articles with relatively high accuracy,
- For high-accuracy article detection with minimal need for detection result editing: use a segmentation model,
- For the highest precision, prepare a dataset using a pre-trained segmentation model to detect with 99-100% accuracy.