By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: How Researchers Are Preprocessing Gigabyte-Sized WSIs for Deep Learning | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > How Researchers Are Preprocessing Gigabyte-Sized WSIs for Deep Learning | HackerNoon
Computing

How Researchers Are Preprocessing Gigabyte-Sized WSIs for Deep Learning | HackerNoon

News Room
Last updated: 2025/07/16 at 6:34 PM
News Room Published 16 July 2025
Share
SHARE

Table of Links

Abstract and I. Introduction

  1. Materials and Methods

    2.1. Multiple Instance Learning

    2.2. Model Architectures

  2. Results

    3.1. Training Methods

    3.2. Datasets

    3.3. WSI Preprocessing Pipeline

    3.4. Classification and RoI Detection Results

  3. Discussion

    4.1. Tumor Detection Task

    4.2. Gene Mutation Detection Task

  4. Conclusions

  5. Acknowledgements

  6. Author Declaration and References

3.3. WSI Preprocessing Pipeline

WSIs are often too large to be directly processed by Deep Learning models. Slides can occupy over 6 GB of memory, so having a dataset of raw slides on disk is not feasible in our case. To store our dataset locally, we decided to focus on one magnification at a time. All our models were trained in separate magnification levels, which guaranteed that this would not be an issue. For some magnifications, the total of patches might exceed the storage space available. To overcome this issue, we developed a pipeline that fetches WSI tiles, processes, filters, efficiently encodes them, and saves the resulting embeddings and relevant metadata on HDF5 files. This pipeline is explained in the following sections.

3.3.1. WSI Metadata Extraction

We start by extracting information about each slide: id, labels type of slide, and microns per pixel (mpp) at which the slide was originally scanned. In the end, all the metadata extracted is saved to a CSV (comma-separated values) file for further processing.

3.3.2. Patch Fetching and Pre-processing

Whole slide images can have a lot of patches consisting exclusively of background or artifacts that make them unusable for training our models. Fetching all these unnecessary patches and checking them one by one becomes inefficient and time-consuming. To avoid this we take advantage of WSI’s multi-scale property to avoid fetching patches that will definitely contain only background.

The tile at the thumbnail level is initially fetched. Otsu’s thresholding [16] is applied to it, as well as a close morphological operation to filter noise. We then extract the resulting black pixel coordinates, obtaining a set of pixels 𝑃 , the pixels that correspond to tissue.

In the acquisition of a WSI, there are often unintentional artifacts due to manual tissue preparation, staining, and scanning hardware, as well as pathologists’ annotations. To mitigate the number of tiles containing artifacts as well as remove some background pixels that might have passed through the previous filter, the color of each pixel 𝑝 ∈ 𝑃 was compared to the average color of 𝑃 , by calculating their Euclidean distance and comparing it with a pre-defined threshold. The coordinates of the pixels that fulfilled this condition were then stored and used to calculate the corresponding coordinates of the patches at the desired magnification, using the hierarchical properties of WSI and the metadata extracted from the step in 3.3.1.

Because these filters were applied at the thumbnail level, some of the tiles were still not suitable for the final dataset. After fetching each tile, we checked if its size in image coordinates was 512 × 512 px. If it was smaller, it was padded accordingly with the average background color. The percentage of tissue present in the tile was then calculated and compared with a threshold. If it does not contain enough tissue, the image is discarded.

For Gene Mutation Detection, due to the size of the FFPE slides and the multiple magnification levels used, we applied random sampling to the tiles:

• At 5x magnification, we sampled 60% of the filtered tiles, when their number was greater than a certain limit; otherwise, we used all the tiles.

• In the case of 10x and 20x magnification, we applied clustering on the tiles from the previous magnification and sampled a chosen number n of tiles per cluster. We then proceeded to use the hierarchical properties of WSI to extract the corresponding tiles at the desired magnifications. For instance, we performed K-means clustering on the tiles at 5x magnification, sampled a maximum of 20 tiles from each cluster, and proceeded to extract, for each tile, the corresponding tiles at 10x magnification (Figure 2).

igure 2: Sampling method for 10x and 20x magnifications. K-means clustering is applied to the set of tiles chosen from the previous magnification 𝑚. N tiles from each cluster are selected and the corresponding tiles at the magnification desired (𝑚 + 1) are then fetched.igure 2: Sampling method for 10x and 20x magnifications. K-means clustering is applied to the set of tiles chosen from the previous magnification 𝑚. N tiles from each cluster are selected and the corresponding tiles at the magnification desired (𝑚 + 1) are then fetched.

3.3.3. Feature Extraction

As mentioned previously, embeddings were created from the tiles. For this, we chose KimiaNet[18], a Densenet 121 pretrained for WSI tumor subtype classification, that produces embedding vectors with a length of 1024. This model was trained exclusively on FFPE slides.

We also performed two data augmentations per tile, composed of random HED stain perturbation, Gaussian noise addition, rotations, and horizontal and vertical flips. Embeddings were generated for these augmented tiles as well. The embeddings are then saved to an HDF5 file, along with the corresponding relevant metadata, such as the coordinates of the patches and labels. This file can then be quickly read to memory during the training process of the models.

Due to the size of the dataset, immediately converting each 512 × 512 pixels patch to an embedding of length 1024 saves storage space (with a decrease in size close to a thousand-fold) and time for training the model and allows us to have the whole dataset locally, instead of having to fetch the data each time we needed to train or fine-tune the models. Furthermore, by having the slides represented in feature space immediately, as opposed to pixel space, we were able to fit all patches in a slide into GPU memory concurrently, which is especially useful for multiple instance learning approaches.

Authors:

(1) Martim Afonso, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, Lisbon, 1049-001, Portugal;

(2) Praphulla M.S. Bhawsar, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, 20850, Maryland, USA;

(3) Monjoy Saha, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, 20850, Maryland, USA;

(4) Jonas S. Almeida, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Bethesda, 20850, Maryland, USA;

(5) Arlindo L. Oliveira, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, Lisbon, 1049-001, Portugal and INESC-ID, R. Alves Redol 9, Lisbon, 1000-029, Portugal.


This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article That’s Not a Human: Google Search Can Now Make Phone Calls To Businesses
Next Article How Payment Links Helped a Busy Mompreneur Manage Her Growing Business
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Squarespace Promo Codes: 50% Off July 2025
Gadget
Samsung Galaxy F36 5G: Everything you need to know ahead of launch
Software
Key prosecutor on Jeffrey Epstein and Diddy cases Maurene Comey sacked by DOJ
News
Indonesia to establish AI center of excellence with support from Nvidia, Cisco and Indosat – News
News

You Might also Like

Computing

Huawei unveils new Pura 70 series smartphones, expected to be on sale from April 18 · TechNode

1 Min Read
Computing

TikTok explores local services potential in Southeast Asia · TechNode

1 Min Read
Computing

General Motors’ China joint venture launches Tesla-style autopilot tech ahead of rivals · TechNode

1 Min Read
Computing

MediaTek develops Arm-based server chips with TSMC’s 3nm process

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?