At the latest re:Invent conference in Las Vegas, Amazon announced the general availability of AWS Glue 5.0, designed to accelerate ETL jobs powered by Apache Spark. The latest release of the serverless data integration service introduces upgraded runtimes, including Spark 3.5.2, Python 3.11, and Java 17, along with enhancements in performance and security.
Designed to develop, run, and scale data integration workloads while getting faster insights, AWS Glue is a serverless data integration service that simplifies the process of preparing and integrating data from multiple sources. The 5.0 release supports advanced features for open table formats, including Apache Iceberg, Delta Lake, and Apache Hudi. It also promises faster job start times, automatic partition pruning, and native access to Amazon S3.
Spark 3.5.2 brings significant improvements to Glue 5.0, including support for Arrow-optimized Python UDFs, Python user-defined table functions, and the RocksDB state store provider as a built-in state store implementation. It also includes numerous improvements related to Spark structured streaming. Additionally, AWS Glue 5.0 updates support for open table format libraries, supporting Apache Hudi 0.15.0, Apache Iceberg 1.6.1, and Delta Lake 3.2.1.
According to the team behind the project, the performance improvements will help reduce costs for data integration workloads:
AWS Glue 5.0 improves the price-performance of your AWS Glue jobs. (…) The TPC-DS dataset is located in an S3 bucket in Parquet format, and we used 30 G.2X workers in AWS Glue. We observed that our AWS Glue 5.0 TPC-DS tests on Amazon S3 were 58% faster than that on AWS Glue 4.0 while reducing cost by 36%.
Within the AWS ecosystem, Glue 5.0 supports native integration with SageMaker Lakehouse, enabling unified access across Amazon Redshift data warehouses and S3 data lakes. Additionally, SageMaker Unified Studio supports Glue 5.0 for compute runtime of unified notebooks and the visual ETL flow editor. The team has also published an article explaining how to enforce fine-grained access control (FGAC) on data lake tables using Glue 5.0 integrated with Lake Formation. They write:
FGAC enables you to granularly control access to your data lake resources at the table, column, and row levels. (…) Using AWS Glue 5.0 with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when AWS Glue runs jobs (…) This feature can save you effort and encourage portability while migrating Spark scripts to different serverless environments such as AWS Glue and Amazon EMR.
Adriano Nicolucci, principal consultant at Slalom, published a video about Glue 5.0 and comments:
If you’re running ETL workflows, these enhancements will boost performance, cut costs, and streamline operations.
Glue 5.0 is now generally available in all AWS regions where Glue is supported.