By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Building a Petabyte-Scale Web Archive | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Building a Petabyte-Scale Web Archive | HackerNoon
Computing

Building a Petabyte-Scale Web Archive | HackerNoon

News Room
Last updated: 2025/12/10 at 2:51 AM
News Room Published 10 December 2025
Share
Building a Petabyte-Scale Web Archive | HackerNoon
SHARE

In an engineer’s ideal world, architecture is always beautiful. In the real world of high-scale systems, you have to make compromises. One of the fundamental problems an engineer must think about at the start is the vicious trade-off between Write Speed and Read Speed.

Usually, you sacrifice one for the other. But in our case, working with petabytes of data in AWS, this compromise didn’t hit our speed–it hit the wallet.

We built a system that wrote data perfectly, but every time it read from the archive, it burned through the budget in the most painful way imaginable. After all, reading petabytes from AWS costs money for data transfer, request counts, and storage class retrievals… A lot of money!

This is the story of how we optimized it to make it more efficient and cost-effective!

Part 0: How We Ended Up Spending $100,000 in AWS Fees!

True story: a few months back, one of our solution architects wanted to pull a sample export from a rare, low-traffic website to demonstrate the product to a potential client. Due to a bug in the API, the safety limit on file count wasn’t applied.

Because the data for this “rare” site was scattered across millions of archives alongside high-traffic sites, the system tried to restore nearly half of our entire historical storage to find those few pages.

That honest mistake ended up costing us nearly $100,000 in AWS fees!

Now, I fixed the API bug immediately (and added strict limits), but the architectural vulnerability remained. It was a ticking time bomb…

Let me tell you the story of the Bright Data Web Archive architecture: how I drove the system into the trap of “cheap” storage and how I climbed out using a Rearrange Pipeline.

Part 1: The “Write-First” Legacy

When I started working on the Web Archive, the system was already ingesting a massive data stream: millions of requests per minute, tens of terabytes per day. The foundational architecture was built with a primary goal: capture everything without data loss.

It relied on the most durable strategy for high-throughput systems: Append-only Log.

  1. Data (HTML, JSON) is buffered.
  2. Once the buffer hits ~300 MB, it is “sealed” into a TAR archive.
  3. The archive flies off to S3.
  4. After 3 days, files move to S3 Glacier Deep Archive.

For the ingestion phase, this design was flawless. Storing data in Deep Archive costs pennies, and the write throughput is virtually unlimited.

The Problem: That Pricing Nuance

The architecture worked perfectly for writing… until clients came asking for historical data. That’s when I faced a fundamental contradiction:

  • The System Writes by Time: An archive from 12:00 PM contains a mix of cnn.com, google.com, and shop.xyz.
  • The System Reads by Domain: The client asks: “Give me all pages from cnn.com for the last year.”

Here lies the mistake that inspired this article. Like many engineers, I’m used to thinking about latency, IOPS, and throughput. But I overlooked the AWS Glacier billing model.

I thought: “Well, retrieving a few thousand archives is slow (48 hours), but it’s not that expensive.”

The Reality: AWS charges not just for the API call, but for the volume of data restored ($ per GB retrieved).

The “Golden Byte” Effect

Imagine a client requests 1,000 pages from a single domain. Because the writing logic was chronological, these pages can be spread across 1,000 different TAR archives.

To give the client these 50 MB of useful data, a disaster occurs:

  1. The system has to trigger a Restore for 1,000 archives.
  2. It lifts 300 GB of data out of the “freezer” (1,000 archives × 300 MB).
  3. AWS bills us for restoring 300 GB.
  4. I extract the 50 MB required and throw away the other 299.95 GB 🤯.

We were paying to restore terabytes of trash just to extract grains of gold. It was a classic Data Locality problem that turned into a financial black hole.

Part 2: Fixing the Mistake: The Rearrange Pipeline

I couldn’t quickly change the ingestion method–the incoming stream is too parallel and massive to sort “on the fly” (though I am working on that), and I needed a solution that worked for already archived data, too.

So, I designed the Rearrange Pipeline, a background process that “defragments” the archive.

This is an asynchronous ETL (Extract, Transform, Load) process, with several critical core components:

  1. Selection: It makes no sense to sort data that clients aren’t asking for. Thus, I direct all new data into the pipeline, as well as data that clients have specifically asked to restore. We overpay for the retrieval the first time, but it never happens a second time.

  2. Shuffling (Grouping): Multiple workers download unsorted files in parallel and organize buffers by domain. Since the system is asynchronous, I don’t worry about the incoming stream overloading memory. The workers handle the load at their own pace.

  3. Rewriting: I write the sorted files back to S3 under a new prefix (to distinguish sorted files from raw ones).

  • Before: 2024/05/05/random_id_ts.tar → [cnn, google, zara, cnn]
  • After: 2024/05/05/cnn/random_id_ts.tar → [cnn, cnn, cnn...]
  1. Metadata Swap: In Snowflake, the metadata table is append-only. Doing MERGE INTO or UPDATE is prohibitively expensive.
  • The Solution: I found it was far more efficient to take all records for a specific day, write them to a separate table using a JOIN, delete the original day’s records, and insert the entire day back with the modified records. I managed to process 300+ days and 160+ billion UPDATE operations in just a few hours on a 4X-Large Snowflake warehouse.

The Result

This change radically altered the product’s economics:

  • Pinpoint Accuracy: Now, when a client asks for cnn.com, the system restores only the data where cnn.com lives.
  • Efficiency: Depending on the granularity of the request (entire domain vs. specific URLs via regex), I achieved a 10% to 80% reduction in “garbage data” retrieval (which is directly proportional to the cost).
  • New Capabilities: Beyond just saving money on dumps, this unlocked entirely new business use cases. Because retrieving historical data is no longer agonizingly expensive, we can now afford to extract massive datasets for training AI models, conducting long-term market research, and building knowledge bases for agentic AI systems to reason over (think specialized search engines). What was previously a financial suicide mission is now a standard operation.

We Are Hiring

Bright Data is scaling the Web Archive even further. If you enjoy:

  • High‑throughput distributed systems,
  • Data engineering at massive scale,
  • Building reliable pipelines under real‑world load,
  • Pushing Node.js to its absolute limits,
  • Solving problems that don’t appear in textbooks…

Then I’d love to talk.

We’re hiring strong Node.js engineers to help build the next generation of the Web Archive. Having data engineering and ETL experience is highly advantageous. Feel free to send your CV to [email protected].

More updates coming as I continue scaling the archive—and as I keep finding new and creative ways to break it!

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article 210,000 Iniu Chargers Sold on Amazon Were Recalled: Check If Yours Is Listed 210,000 Iniu Chargers Sold on Amazon Were Recalled: Check If Yours Is Listed
Next Article How to use the new hand gestures on the Pixel Watch 4 (and what actually works right now) How to use the new hand gestures on the Pixel Watch 4 (and what actually works right now)
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

BYD surpasses Tesla in Europe EV sales for the first time: JATO · TechNode
BYD surpasses Tesla in Europe EV sales for the first time: JATO · TechNode
Computing
2026 could finally be the year when foldables become mainstream
2026 could finally be the year when foldables become mainstream
News
How Online Crypto Casinos Use Celebrities and Livestreamers to Recruit Gamblers
Software
I ditched my smartphone to listen to music on my decade-old iPod — here’s what happened
I ditched my smartphone to listen to music on my decade-old iPod — here’s what happened
News

You Might also Like

BYD surpasses Tesla in Europe EV sales for the first time: JATO · TechNode
Computing

BYD surpasses Tesla in Europe EV sales for the first time: JATO · TechNode

1 Min Read
Linux 6.19 Gets Rid Of The Kernel’s “Genocide” Function
Computing

Linux 6.19 Gets Rid Of The Kernel’s “Genocide” Function

3 Min Read
Li Auto reportedly cuts delivery goal amid fierce competition, lackluster demand · TechNode
Computing

Li Auto reportedly cuts delivery goal amid fierce competition, lackluster demand · TechNode

1 Min Read
China’s GAC sells five EV models in Brazil with view to local production · TechNode
Computing

China’s GAC sells five EV models in Brazil with view to local production · TechNode

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?