By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: 350PB, Millions of Events, One System: Inside Uber’s Cross-Region Data Lake and Disaster Recovery
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > 350PB, Millions of Events, One System: Inside Uber’s Cross-Region Data Lake and Disaster Recovery
News

350PB, Millions of Events, One System: Inside Uber’s Cross-Region Data Lake and Disaster Recovery

News Room
Last updated: 2026/01/16 at 1:26 PM
News Room Published 16 January 2026
Share
350PB, Millions of Events, One System: Inside Uber’s Cross-Region Data Lake and Disaster Recovery
SHARE

Uber has built HiveSync, a sharded batch replication system that keeps Hive and HDFS data synchronized across multiple regions, handling millions of Hive events daily. HiveSync ensures cross-region data consistency, enables Uber’s disaster recovery strategy, and eliminates inefficiency caused by the secondary region sitting idle, which previously incurred hardware costs equal to the primary, while still maintaining high availability.

Built initially on the open-source Airbnb ReAir project, HiveSync has been extended with sharding, DAG-based orchestration, and a separation of control and data planes. ETL jobs now execute exclusively in the primary data center, while HiveSync handles cross-region replication with near real-time consistency, preserving disaster readiness and analytics access. Sharding allows tables and partitions to be divided into independent units for parallel replication and fine-grained fault tolerance.

HiveSync separates the control plane, which orchestrates jobs and manages state in a relational metadata store, from the data plane, which performs HDFS and Hive file operations. A Hive Metastore event listener captures DDL and DML changes, logging them to MySQL and triggering replication workflows. Jobs are represented as finite-state machines, supporting restartability and robust failure recovery.

HiveSync architecture: control plane and data plane separation (Source: Uber Blog Post)

HiveSync has two main components: the HiveSync Replication Service and the Data Reparo Service. The Replication Service uses a Hive Metastore Event Listener to capture table and partition changes in real-time, logging them asynchronously in MySQL. These audit entries are converted into asynchronous replication jobs executed as finite-state machines, with states persisted for reliability. Uber uses a hybrid strategy: smaller jobs use RPC for efficiency, while larger jobs leverage DistCp on YARN. A DAG manager enforces shard-level ordering and locks, and static and dynamic sharding enable horizontal scaling, ensuring consistent, conflict-free replication.

HiveSync replication service (Source:  Uber Blog Post)

Data Reparo is a reconciliation service that continuously detects anomalies, such as missing partitions or out-of-band HDFS updates, and restores parity between datacenter1 (DC1) and datacenter2 (DC2) to maintain data correctness. HiveSync maintains a four-hour replication SLA with a 99th percentile lag of around 20 minutes and supports a one-time replication service for bootstrapping historical datasets into new regions or clusters before switching to incremental replication. Uber’s Data Reparo service scans DC1 and DC2 for anomalies, such as missing or extra partitions, and fixes any mismatches to ensure cross-region consistency, targeting over 99.99% accuracy.

Data Reparo analyzes and resolves inconsistencies across data centers (Source: Uber Blog Post)

HiveSync operates at a massive scale, managing 800,000 Hive tables totaling approximately 300 petabytes of data, with individual tables ranging from a few gigabytes to tens of petabytes. Partitions per table vary from a few hundred to over a million. Each day, HiveSync processes over 5 million Hive DDL and DML events, replicating about 8 petabytes of data across regions.

Looking ahead, Uber plans to extend HiveSync for cloud replication use cases as batch analytics and ML pipelines migrate to Google Cloud, further leveraging sharding, orchestration, and reconciliation to maintain petabyte-scale data integrity efficiently.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Why Pepeto Tops the List of Meme Coins for January 2026 | HackerNoon Why Pepeto Tops the List of Meme Coins for January 2026 | HackerNoon
Next Article Huawei unveils MateBook X Pro, its first AI-powered laptop · TechNode Huawei unveils MateBook X Pro, its first AI-powered laptop · TechNode
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Google brings its AI videomaker to Workspace users
Google brings its AI videomaker to Workspace users
News
China’s Chery, Huawei slash prices of first electric sedan due to delays · TechNode
China’s Chery, Huawei slash prices of first electric sedan due to delays · TechNode
Computing
5 Best Alien Invasion Movies Of All Time, Ranked – BGR
5 Best Alien Invasion Movies Of All Time, Ranked – BGR
News
Motivated Hard+ Energy Signals a New Shift in Digital Men’s Health
Motivated Hard+ Energy Signals a New Shift in Digital Men’s Health
Gadget

You Might also Like

Google brings its AI videomaker to Workspace users
News

Google brings its AI videomaker to Workspace users

1 Min Read
5 Best Alien Invasion Movies Of All Time, Ranked – BGR
News

5 Best Alien Invasion Movies Of All Time, Ranked – BGR

11 Min Read
ChatGPT Introduces Lower-Priced Subscription Tier With These Features
News

ChatGPT Introduces Lower-Priced Subscription Tier With These Features

4 Min Read
White House, bipartisan governors call on biggest US grid operator to lower prices
News

White House, bipartisan governors call on biggest US grid operator to lower prices

0 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?