Meta recently open-sourced OpenZL, a new data compression framework for highly structured data that explicitly models schemas to achieve a better compression ratio and faster speeds than general-purpose tools like Zstandard (Zstd). The framework maintains operational simplicity via a universal decompressor that executes an embedded Plan, removing the need for external metadata and enabling fleet-wide updates from a single binary.
When Meta introduced Zstd nearly a decade ago, it became a cornerstone of large-scale data infrastructure thanks to its speed and efficiency. However, as workloads evolved, particularly those involving structured formats such as Protocol Buffers, database tables, and ML tensors, Meta engineers found that generic compression methods were leaving untapped potential for gains. Traditional compressors treat data as raw byte streams, failing to leverage the inherent structure and patterns in modern datasets.
OpenZL takes a different approach by explicitly modeling data structures such as columnar layouts, enumerations, and repetitive patterns, rather than treating everything as an undifferentiated “byte soup.” This structured compression allows OpenZL to outperform general-purpose tools like standard Zstd on both compression ratio and speed for relevant datasets. Instead of guessing optimal techniques, OpenZL applies a configurable sequence of reversible transforms to expose latent order in the data before the final entropy-coding stage.
(Source: Meta blog post)
A key operational advantage of OpenZL is its universal decompressor. Compression Plans are generated offline by a component called the trainer, which analyzes the provided data schema and produces an optimized Plan. During encoding, this plan is converted into a concrete decode recipe and embedded directly within the compressed frame.
This model means that:
- Every OpenZL file, regardless of its custom transform sequence, can be decompressed using the same binary.
- The decoder requires no external metadata—it executes the embedded recipe.
- Retraining or re-optimizing compression plans can improve performance without altering the universal decoder, ensuring backward compatibility.
Meta engineers emphasize that this operational simplicity is critical for data center deployments: one audited decompression surface, fleet-wide updates from a single binary, and clear version control across large infrastructures.
In internal benchmarks on structured datasets (e.g., Silesia Compression Corpus’s “sao star” records), OpenZL showed substantial gains. By parsing structured records, splitting them into fields (Structure of Arrays), and applying domain-aware transforms such as delta encoding, it achieved:
|
Compressor
|
Compressed Size
|
Compression Ratio
|
Compression Speed
|
Decompression Speed
|
|
zstd -3
|
5,531,935 B
|
$times 1.31$
|
220 MB/s
|
850 MB/s
|
|
xz -9
|
3,516,649 B
|
$times 2.06$
|
3.5 MB/s
|
45 MB/s
|
|
OpenZL
|
4,414,351 B
|
$times 1.64$
|
340 MB/s
|
1200 MB/s
|
Crucially, OpenZL demonstrated a better compression ratio while preserving or improving both compression and decompression speeds compared to zstd -3.
Users can describe their data structure using the Simple Data Description Language (SDDL) or a custom parser function. The offline trainer then uses a budgeted search over transform choices to generate an optimal compression Plan.
Unlike some experimental formats that embed general-purpose code, such as WebAssembly for decompression, OpenZL’s approach limits execution to a deterministic graph. This ensures reproducible decoding, a key property for long-term data archival. As one Hacker News correspondent noted, while sandboxing WebAssembly is easy:
The real problem is determinism—function calls made to those WebAssembly modules may still be nondeterministic!
By contrast, OpenZL’s fixed execution graph guarantees deterministic decompression behavior.
OpenZL performs best on structured data, such as time-series datasets, ML tensors, and database tables. Where structure is minimal (e.g., pure text), OpenZL intelligently falls back to using Zstd. Abelardo Fukasawa, a researcher at Quantls Infinity, reinforced this point, stating:
Instead of treating every format the same (as gzip or standard compressors do), it adapts its compression to the specific structure of the data using SDDL—often yielding better ratios and throughput on structured workloads.
The framework is publicly available on GitHub for developers to experiment with and contribute to.
