Data contracts define the interface between data providers and consumers, specifying things like data models, quality guarantees, and ownership. According to Jochen Christ, they are essential for distributed data ownership in data mesh, ensuring data is discoverable, interoperable, and governed. Data contracts improve communication between teams and enhance the reliability and quality of data products.
Jochen Christ spoke about data contracts at OOP conference.
Data contracts are what APIs are for software systems, Christ said. They are an interface specification between a data provider and their data consumers. Data contracts specify the provided data model with the syntax, format, and semantics, but also contain data quality guarantees, service-level objectives, and terms and conditions for using the data, Christ mentioned. They also define the owner of the provided data product that is responsible if there are any questions or issues, he added.
Data mesh is an important driver for data contracts, as data mesh introduces distributed ownership of data products, Christ said. Before that, we usually had just one central team that was responsible for all data and BI activities, with no need to specify interfaces with other teams.
With a data mesh, we have multiple teams that exchange their data products over a shared infrastructure. This shift requires clear, standardized interfaces between teams to ensure data is discoverable, interoperable, and governed effectively, Christ explained:
Data contracts provide a way to formalize these interfaces, enabling teams to independently develop, maintain, and consume data products while adhering to platform-wide standards.
Christ mentioned that the main challenge teams face when exchanging data sets is to understand domain semantics. He gave some examples:
If there is a field called “order_timestamp”, is it the timestamp when the customer clicked on “buy now”, is it the payment succeeded event, or is it the order confirmation email?
Another example is enumerations, such as a “status” field, which highly depends on the implemented business process and exception-handling routines.
Data contracts are written in YAML, so they are machine-readable, Christ said. Tools like Data Contract CLI can extract syntax, format, and quality checks from the data contract, connect to the data product, and test that the data product complies with the data contract specification. When these checks are included in a CI/CD deployment pipeline or data pipeline, data engineers can ensure that their data products are valid, Christ mentioned.
Data users can rely on data contracts when consuming data from other teams, especially when data contracts are automatically tested and enforced, Christ said. This is a significant improvement compared to earlier practices, where data engineers had to manually trace the entire lineage of a field using lineage attributes to determine whether it was appropriate and trustworthy for their use case, he explained:
By formalizing and automating these guarantees, data contracts make data consumption more efficient and reliable.
Data providers benefit by gaining visibility into which consumers are accessing their data. Permissions can be automated accordingly, and when changes need to be implemented in a data product, a new version of the data contract can be introduced and communicated with the consumers, Christ said.
With data contracts, we have very high-quality metadata, Christ said. This metadata can be further leveraged to optimize governance processes or build an enterprise data marketplace, enabling better discoverability, transparency, and automated access management across the organization to make data available for more teams.
Data contracts are transforming the way data teams collaborate, Christ explained:
For example, we can use data contracts as a tool for requirements engineering. A data consumer team can propose a draft data contract specifying the information they need for a particular use case. This draft serves as a basis for discussions with the data providers about whether the information is available in the required semantics or what alternatives might be feasible.
Christ called this contract-first development. In this way, data contracts foster better communication between teams, he concluded.
InfoQ interviewed Jochen Christ about data contracts.
InfoQ: How do data contracts look?
Jochen Christ: Data contracts are usually expressed as YAML documents, similar to OpenAPI specifications.
dataContractSpecification: 1.1.0 info: title: Orders Latest owner: Checkout Team terms: usage: Data can be used for AI use cases. models: orders: type: table description: All webshop orders since 2020 fields: order_id: type: text format: uuid order_total: description: Total amount in cents. type: long required: true examples: - 9999
InfoQ: How do data contracts support exchanging data sets between teams?
Christ: With data contracts, we have a technology-neutral way to express the semantics, and we can define data quality checks in the contract to test these guarantees and expectations.
Here is a quick example:
order_total: description: | Total amount in the smallest monetary unit (e.g., cents). The amount includes all discounts and shipping costs. The amount can be zero, but never negative. type: long required: true minimum: 0 examples: - 9999 classification: restricted quality: - type: sql description: 95% of all values are expected to be between 10 and 499 EUR. query: | SELECT quantile_cont(order_total, 0.95) AS percentile_95 FROM orders mustBeBetween: [1000, 49900]
This is the metadata specification of a field “order_total” which not only defines the technical type (long), but also the business semantics that help to understand the values, e.g., it is important to understand that the amount is not in EUR, but in cents. There is a security classification defined (“restricted”), and the quality attribute defines business expectations that we can use to validate whether a dataset is valid or probably corrupt.
InfoQ: How can we use data contracts to generate code and automated tests?
Christ: In the previous “order_total” example, the data quality SQL query can be used by data quality tools (such as the Data Contract CLI) to execute data quality checks in deployment pipelines.
In the same way, the CLI can generate code, such as SQL DDL statements, language-specific data models, or HTML exports from the data model in the data contract.