AWS has recently announced that Amazon S3 now supports sort and z-order compaction for Apache Iceberg tables. The new features reduce scan times and engine costs, and are available for both S3 Tables and traditional S3 buckets using AWS Glue Data Catalog optimization.
Sort compaction minimizes the number of data files scanned by query engines, while z-order compaction provides additional performance benefits through efficient file pruning when querying across multiple columns simultaneously. Sébastien Stormacq, principal developer advocate at AWS, explains:
When working with high-ingest or frequently updated datasets, data lakes can accumulate many small files that impact the cost and performance of your queries. (…) Although the default binpack strategy with managed compaction provides notable performance improvements, introducing sort and z-order compaction options for both S3 and S3 Tables delivers even greater gains for queries filtering across one or more dimensions.
Sort compaction organizes files based on a user-defined column order. When tables have a defined sort order, S3 Tables compaction will now use sort
to cluster similar values together during the compaction process.
In Apache Iceberg, compaction can be used to combine small files into larger files (bin packing), merge delete files with data files, sort the data in accordance with query patterns or cluster the data by using space-filling curves to optimize for distinct query patterns (z-order sorting).
S3 Tables provide a managed experience with automatic hierarchical sorting during compaction, based on defined table metadata. For equal prioritization of multiple query predicates, z-order
compaction can be enabled via the maintenance API. For Iceberg tables in general-purpose S3 buckets, the compaction method can be configured in the Glue Data Catalog console. Stormacq adds:
In my experience, depending on my data layout and query patterns, I observed performance improvements of threefold or more when switching from binpack to sort or z-order.
Ruben Simon, product manager at BMW, comments:
At BMW’s largest big data analytics platform, using thousands of S3 buckets and Iceberg tables, we saw major query performance gains with Z-ordering. (…) Bloom filters next would make it even more powerful.
In the article “S3 Managed Tables, Unmanaged Costs: The 20x Surprise with AWS S3 Tables”, Vinish Reddy Pannala, software engineer at Onehouse.ai, and Kyle Weller, VP of Product at Onehouse.ai, question the lack of configurable options for compaction:
Roughly 3 hours after the table was created, S3 Tables finally triggered compaction executing 10 replace operations and compacting approximately 100 GB of data over the course of 1 hour. (…) This exposes a deeper flaw in the S3Tables approach, where it does not recognize that ideal compaction configurations are specific to different types of readers and writers.
Existing compacted files will remain unchanged, and only new data written after enabling sort or z-order will be affected, unless the customer explicitly rewrites data using standard Iceberg tools or by increasing the target file size in the table maintenance settings. Yonatan Dolan, principal analytics specialist at AWS, warns:
Everyone talks about Sort, Z-order, and BinPack compaction when tuning query performance in Apache Iceberg – and yes, sorting helps (when done right), and Z-order can outperform bin-packing on the right queries. But in my benchmarks using TPC-H SF100 lineitem (~600M rows / 17GB compressed), I found something even more influential: The starting size of your files before compaction can massively impact cost.
Source: Yonatan Dolan’s post
The new compaction options are available in all regions where S3 Tables are supported and, for standard S3 buckets, where integration with Glue Data Catalog is available. There are no specific costs associated with the new features.