350PB, Millions Of Events, One System: Inside Uber’s Cross-Region Data Lake And Disaster Recovery

Uber has built HiveSync, a sharded batch replication system that keeps Hive and HDFS data synchronized across multiple regions, handling millions of Hive events daily. HiveSync ensures cross-region data consistency, enables Uber’s disaster recovery strategy, and eliminates inefficiency caused by the secondary region sitting idle, which previously incurred hardware costs equal to the primary, while still maintaining high availability.

Built initially on the open-source Airbnb ReAir project, HiveSync has been extended with sharding, DAG-based orchestration, and a separation of control and data planes. ETL jobs now execute exclusively in the primary data center, while HiveSync handles cross-region replication with near real-time consistency, preserving disaster readiness and analytics access. Sharding allows tables and partitions to be divided into independent units for parallel replication and fine-grained fault tolerance.

HiveSync separates the control plane, which orchestrates jobs and manages state in a relational metadata store, from the data plane, which performs HDFS and Hive file operations. A Hive Metastore event listener captures DDL and DML changes, logging them to MySQL and triggering replication workflows. Jobs are represented as finite-state machines, supporting restartability and robust failure recovery.

HiveSync architecture: control plane and data plane separation (Source: Uber Blog Post)

HiveSync has two main components: the HiveSync Replication Service and the Data Reparo Service. The Replication Service uses a Hive Metastore Event Listener to capture table and partition changes in real-time, logging them asynchronously in MySQL. These audit entries are converted into asynchronous replication jobs executed as finite-state machines, with states persisted for reliability. Uber uses a hybrid strategy: smaller jobs use RPC for efficiency, while larger jobs leverage DistCp on YARN. A DAG manager enforces shard-level ordering and locks, and static and dynamic sharding enable horizontal scaling, ensuring consistent, conflict-free replication.

HiveSync replication service (Source: Uber Blog Post)

Data Reparo is a reconciliation service that continuously detects anomalies, such as missing partitions or out-of-band HDFS updates, and restores parity between datacenter1 (DC1) and datacenter2 (DC2) to maintain data correctness. HiveSync maintains a four-hour replication SLA with a 99th percentile lag of around 20 minutes and supports a one-time replication service for bootstrapping historical datasets into new regions or clusters before switching to incremental replication. Uber’s Data Reparo service scans DC1 and DC2 for anomalies, such as missing or extra partitions, and fixes any mismatches to ensure cross-region consistency, targeting over 99.99% accuracy.

Data Reparo analyzes and resolves inconsistencies across data centers (Source: Uber Blog Post)

HiveSync operates at a massive scale, managing 800,000 Hive tables totaling approximately 300 petabytes of data, with individual tables ranging from a few gigabytes to tens of petabytes. Partitions per table vary from a few hundred to over a million. Each day, HiveSync processes over 5 million Hive DDL and DML events, replicating about 8 petabytes of data across regions.

Looking ahead, Uber plans to extend HiveSync for cloud replication use cases as batch analytics and ML pipelines migrate to Google Cloud, further leveraging sharding, orchestration, and reconciliation to maintain petabyte-scale data integrity efficiently.

350PB, Millions of Events, One System: Inside Uber’s Cross-Region Data Lake and Disaster Recovery

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply