The Apache Software Foundation has recently announced the general availability of Apache Hudi 1.0, the transactional data lake platform with support for near real-time analytics. Initially introduced in 2017, Apache Hudi provides an open table format optimized for efficient writes in incremental data pipelines and fast query performance.
Originally developed at Uber as an incremental processing framework on Apache Hadoop and submitted to the Apache Software Foundation in 2019, Hudi is designed to bridge the gap between database-like functionality and open data lakehouse architectures. Hudi’s main strength lies in its ability to support both near real-time and batch queries simultaneously.
The latest release introduces new features aimed at transforming data lakehouses into what the project community considers a fully-fledged “Data Lakehouse Management System” (DLMS). Vinoth Chandar, creator of the Hudi Project at Uber and CEO at Onehouse, writes:
Hudi shines by providing a high-performance open table format as well as a comprehensive open-source software stack that can ingest, store, optimize and effectively self-manage a data lakehouse. This distinction between open formats and open software is often lost in translation inside the large vendor ecosystem in which Hudi operates. Still, it has been and remains a key consideration for Hudi’s users to avoid compute-lockin to any given data vendor.
Released under an Apache License 2.0, Hudi 1.0 introduces a new secondary indexing system designed to enhance query performance and reduce data scanning costs. Users can now create SQL-based indexes on secondary columns, significantly speeding up query execution. The release also includes expression-based indexing, similar to a feature in PostgreSQL, which replaces traditional partitioning strategies to enable more flexible and efficient data organization. When the preview was announced last year, Boris Litvak, principal software engineer at Snyk, wrote:
Among the big 3 ACID storage formats on Object Storage, Apache Hudi 1.0 (beta) is the first one introducing “functional indexes” over the data. We usually call it “secondary indexes” in SQL DB jargon. When will Delta.io and Apache Iceberg follow?
Source: Apache Hudi Blog
The release introduces support for partial updates, which improves storage and compute efficiency by allowing updates to specific fields instead of entire rows. Additionally, non-blocking concurrency control enables multiple streaming jobs to write to the same dataset without causing bottlenecks or failures. Discussing the database architecture, Chandar adds:
Regarding full-fledged DLMS functionality, the closest experience Hudi 1.0 offers is through Apache Spark. Users can deploy a Spark server (or Spark Connect) with Hudi 1.0 installed, submit SQL/jobs, orchestrate table services via SQL commands, and enjoy new secondary index functionality to speed up queries like a DBMS.
Hudi 1.0 introduces enhancements to the storage engine, including the adoption of a log-structured merge (LSM) tree for efficient timeline management. This supports long-term data retention and ensures high-performance query planning, even for datasets containing billions of records. Bhavani Sudha Saktheeswaran, software engineer at Onehouse and Apache Hudi PMC, comments:
Whether you’re building an open data platform, streaming into the data lakehouse, moving away from data warehouses, or optimizing for high-performance queries, Hudi 1.0.0 makes it easier than ever to work with lakehouses.
Saktheeswaran and Saketh Chintapalli, software engineer at Uber, presented a session on incremental data processing with Apache Hudi at QCon San Francisco. The session recording is available on InfoQ.