At the Databricks Data+AI Summit, held in San Francisco, USA, from June 10 to 12, Databricks announced that it is contributing the technology behind Delta Live Tables (DLT) to the Apache Spark project, where it will be called Spark Declarative Pipelines. This move will make it easier for Spark users to develop and maintain streaming pipelines, and furthers Databrick’s commitment to open source.
The new feature will allow developers to define data streaming pipelines without needing to create the usual imperative commands in Spark. While the changes simplify the task of writing and maintaining pipeline code, users will still need to understand the runtime behavior of Spark and be able to troubleshoot issues such as performance and correctness.
In a blog post that describes the new feature, Databricks wrote that pipelines could be defined using SQL syntax or via a simple Python SDK that declares the stream data sources, tables and their relationship, rather than writing imperative Spark commands. The company claims this will reduce the need for orchestrators such as Apache Airflow to manage pipelines.
Behind the scenes, the framework interprets the query then creates a dependency graph and optimized execution plan.
Declarative Pipelines supports streaming tables from stream data sources such as Apache Kafka topics, and materialized views for storing aggregates and results. The materialized views are updated automatically as new data arrives from the streaming tables.
Databricks provide an overview of the SQL syntax in their documentation. An excerpt is shown here. The example is based on the New York City TLC Trip Record Data data set.
-- Bronze layer: Raw data ingestion
CREATE OR REFRESH STREAMING TABLE taxi_raw_records
(CONSTRAINT valid_distance EXPECT (trip_distance > 0.0) ON VIOLATION DROP ROW)
AS SELECT *
FROM STREAM(samples.nyctaxi.trips);
-- Silver layer 1: Flagged rides
CREATE OR REFRESH STREAMING TABLE flagged_rides
AS SELECT
date_trunc("week", tpep_pickup_datetime) as week,
pickup_zip as zip,
fare_amount, trip_distance
FROM
STREAM(LIVE.taxi_raw_records)
WHERE ((pickup_zip = dropoff_zip AND fare_amount > 50) OR
(trip_distance < 5 AND fare_amount > 50));
The example shows how a pipeline can be built by defining streams, with the CREATE STREAMING TABLE command, and then consuming them with a FROM statement in subsequent queries.. Of note in the example is the ability to include data quality checks in the pipeline with the syntax CONSTRAIN … EXPECT … ON VIOLATION.
While the Apache Spark changes are not yet released, many articles already describe the experience of engineers using Databricks DLT. In an article in Medium titled “Why I Liked Delta Live Tables in Databricks,” Mariusz Kujawski describes the features of DLT and how they can best be used: “With DLT, you can build an ingestion pipeline in just a few hours, compared to the days required to develop a custom framework. Additionally, built-in data quality enforcement provides an extra layer of reliability.”
In addition to a declarative syntax for defining a pipeline, Spark Declarative Pipelines also supports change data capture (CDC), batch and stream logic, built in retry logic, and observability hooks.
Declarative pipelines are in the process of being merged into the Spark project. The feature is planned for the next Spark Release, 4.10, which is expected in January 2026. Progress can be followed on the Apache Jira Spark project in ticket SPARK-51727.