Building Reproducible ML Systems With Apache Iceberg And SparkSQL: Open Source Foundations

Key Takeaways

Time travel in Apache Iceberg lets you pinpoint exactly which data snapshot produced your best results instead of playing detective with production logs.

Smart partitioning can slash query times from hours to minutes just by partitioning on the same columns you are already filtering on anyway.

With schema evolution, you can actually add new features without that sinking feeling of potentially breaking six months of perfectly good ML pipelines.

ACID transactions eliminate those mystifying training failures that happen when someone else is writing to your table.

The open source route gives you enterprise-grade reliability without the enterprise price tag or vendor lock-in, plus you get to customize things when your ML team inevitably has special requirements that nobody anticipated.

If you’ve spent any time building ML systems in production, you know the pain. Your model crushes it in development, passes all your offline tests, then mysteriously starts performing like garbage in production. Sound familiar?

Nine times out of ten, it’s a data problem. And not just any data problem: it’s the reproducibility nightmare that keeps data engineers up at night. We’re talking about recreating training datasets months later, tracking down why features suddenly look different, or figuring out which version of your data actually produced that miracle model that everyone’s asking you to “just rebuild really quickly”.

Traditional data lakes? They’re great for storing massive amounts of stuff, but they’re terrible at the transactional guarantees and versioning that ML workloads desperately need. It’s like trying to do precision surgery with a sledgehammer.

That’s where Apache Iceberg comes in. Combined with SparkSQL, it brings actual database-like reliability to your data lake. Time travel, schema evolution, and ACID transactions are the things that should have been obvious requirements from day one but somehow got lost in the shuffle toward “big data at scale”.

The ML Data Reproducibility Problem

The Usual Suspects

Let’s be honest about what’s really happening in most ML shops. Data drift creeps in silently: Your feature distributions shift over time, but nobody notices until your model starts making predictions that make no sense. Feature pipelines are supposed to be deterministic, but they’re not; run the same pipeline twice and you’ll get subtly different outputs thanks to timestamp logic or just plain old race conditions.

Then there’s the version control situation. We’ve gotten pretty good at versioning our code (though let’s not talk about that “temporary fix” from three months ago that’s still in prod). But data versioning? It’s still mostly manual processes, spreadsheets, and prayers.

Here’s a scenario that’ll sound depressingly familiar: Your teammate runs feature engineering on Monday, you run it on Tuesday, and suddenly you’re getting different results from what should be identical source data. Why? Because the underlying tables changed between Monday and Tuesday, and your “point-in-time” logic isn’t as point-in-time as you thought.

The really insidious problems show up during model training. Multiple people access the same datasets, schema changes break pipelines without warning, and concurrent writes create race conditions that corrupt your carefully engineered features. It’s like trying to build a house while someone keeps changing the foundation.

Why Traditional Data Lakes Fall Short

Data lakes were designed for a world where analytics required running batch reports and maybe some ETL jobs. The emphasis was on storage scalability, not transactional integrity. That worked fine when your biggest concern was generating quarterly reports.

But ML is different. It’s iterative, experimental, and demands consistency in ways that traditional analytics never did. When your model training job reads partially written data (because someone else is updating the same table), you don’t just get a wrong report, you get a model that learned from garbage data and will make garbage predictions.

Schema flexibility sounds great in theory, but in practice it often creates schema chaos. Without proper evolution controls, well-meaning data scientists accidentally break downstream systems when they add that “just one more feature” to an existing table. Good luck figuring out what broke and when.

The metadata situation is even worse. Traditional data lakes track files, not logical datasets. So when you need to understand feature lineage or implement data quality checks, you’re basically flying blind.

The Hidden Costs

Poor data foundations create costs that don’t show up in any budget line item. Your data scientists spend most of their time wrestling with data instead of improving models. I’ve seen studies suggesting sixty to eighty percent of their time goes to data wrangling. That’s… not optimal.

When something goes wrong in production – and it will – debugging becomes an archaeology expedition. Which data version was the model trained on? What changed between then and now? Was there a schema modification that nobody documented? These questions can take weeks to answer, assuming you can answer them at all.

Let’s talk about regulatory compliance for a minute. Try explaining to an auditor why you can’t reproduce the exact dataset used to train a model that’s making loan decisions. That’s a conversation nobody wants to have.

Iceberg Fundamentals for ML

Time Travel That Actually Works

Iceberg’s time travel is a snapshot-based architecture that maintains complete table metadata for every write operation. Each snapshot represents a consistent view of your table at a specific point in time, including schema, partitions, and everything.

For ML folks, this is game-changing. You can query historical table states with simple SQL:


-- This is how you actually solve the reproducibility problem
SELECT * FROM ml_features 
FOR SYSTEM_TIME AS OF '2024-01-15 10:30:00'

-- Or if you know the specific snapshot ID
SELECT * FROM ml_features 
FOR SYSTEM_VERSION AS OF 1234567890

No more guessing which version of your data produced good results. No more “well, it worked last week” conversations. You can compare feature distributions across time periods, analyze model performance degradation by examining historical data states, and build A/B testing frameworks that actually give you consistent results.

The metadata includes file-level statistics, so query optimizers can make smart decisions about scan efficiency. This isn’t just about correctness, it’s about performance too.

Schema Evolution Without the Drama

Here’s something that should be basic but somehow isn’t: Adding a new column to your feature table shouldn’t require a team meeting and a migration plan. Iceberg’s schema evolution lets you adapt tables to changing requirements without breaking existing readers or writers.

You can add columns, rename them, reorder them, and promote data types, all while maintaining backward and forward compatibility. For ML pipelines, data scientists can safely add new features without coordinating complex migration processes across multiple teams.

The system tracks column identity through unique field IDs, so renaming a column doesn’t break existing queries. Type promotion follows SQL standards (integer to long, float to double), so you don’t have to worry about data loss.


-- Adding a feature is actually this simple
ALTER TABLE customer_features 
ADD COLUMN lifetime_value DOUBLE

-- Renaming doesn't break anything downstream
ALTER TABLE customer_features 
RENAME COLUMN purchase_frequency TO avg_purchase_frequency

ACID Transactions (Finally!)

ACID support allows ML workloads to safely operate on shared datasets without corrupting data or creating inconsistent reads. Iceberg uses optimistic concurrency control: Multiple writers can work simultaneously, but conflicts get detected and resolved automatically.

Isolation levels prevent readers from seeing partial writes. So when your training job kicks off, it’s guaranteed to see a consistent snapshot of the data, even if someone else is updating features in real-time.

Transaction boundaries align with logical operations rather than individual file writes. You can implement complex feature engineering workflows that span multiple tables and partitions while maintaining consistency guarantees. No more “oops, only half my update succeeded” situations.

Building Reproducible Feature Pipelines

Partitioning That Actually Makes Sense

Let’s talk partitioning strategy, because this is where a lot of teams shoot themselves in the foot. The secret isn’t complex: Partition on dimensions that align with how you actually query the data.

Most ML workloads follow temporal patterns, training on historical data and predicting on recent data. So temporal partitioning using date or datetime columns is usually your best bet. The granularity depends on your data volume. For high-volume systems, use daily partitions. For smaller datasets, use weekly or monthly partitions.


CREATE TABLE customer_features (
    customer_id BIGINT,
    feature_timestamp TIMESTAMP,
    demographic_features MAP<STRING, DOUBLE>,
    behavioral_features MAP<STRING, DOUBLE>,
    target_label DOUBLE
) USING ICEBERG
PARTITIONED BY (days(feature_timestamp))

Multi-dimensional partitioning can work well if you have natural business groupings. Customer segments, product categories, geographic regions, etc., reflect how your models actually slice the data.

Iceberg’s hidden partitioning is particularly nice because it maintains partition structures automatically without requiring explicit partition columns in your queries. Write simpler SQL, get the same performance benefits.

But don’t go crazy with partitioning. I’ve seen teams create thousands of tiny partitions thinking it will improve performance, only to discover that metadata overhead kills query planning. Keep partitions reasonably sized (think hundreds of megabytes to gigabytes) and monitor your partition statistics.

Data Versioning for Experiments

Reproducible experiments require tight coupling between data versions and model artifacts. This is where Iceberg’s snapshots really shine. They provide the foundation for implementing robust experiment tracking that actually links model performance to specific data states.

Integration with MLflow or similar tracking systems creates auditable connections between model runs and data versions. Each training job records the snapshot ID of input datasets, precisely reproducing the experimental conditions.


import mlflow
from pyspark.sql import SparkSession

def train_model_with_versioning(spark, snapshot_id):
    # Load data from specific snapshot
    df = spark.read 
        .option("snapshot-id", snapshot_id) 
        .table("ml_features.customer_features")
    
    # Log data version in MLflow
    mlflow.log_param("data_snapshot_id", snapshot_id)
    mlflow.log_param("data_row_count", df.count())
    
    # Continue with model training...

Branching and tagging capabilities enable more sophisticated workflows. Create stable data branches for production model training while continuing development on experimental branches. Use tags to mark significant milestones, such as quarterly feature releases, regulatory checkpoints, etc.

Feature Store Integration

Modern ML platforms demand seamless integration between data infrastructure and experiment management. Iceberg tables play nicely with feature stores, leveraging time travel capabilities to serve both historical features for training and point-in-time features for inference.

This combination provides consistent feature definitions across training and serving while maintaining the performance characteristics needed for real-time inference. No more training-serving skew because your batch and streaming feature logic diverged.


from feast import FeatureStore
import pandas as pd

def get_training_features(entity_df, feature_refs, timestamp_col):
    try:
        fs = FeatureStore(repo_path=".")
        
        # Point-in-time join using Iceberg backing store
        training_df = fs.get_historical_features(
            entity_df=entity_df,
            features=feature_refs,
            full_feature_names=True
        ).to_df()
        
        return training_df
    except Exception as e:
        print(f"Error accessing feature store: {e}")
        raise

Production Implementation

Real-World Example: Customer Churn Prediction

Let me walk you through a customer churn prediction system that actually works in production. This system processes millions of customer interactions daily while maintaining full reproducibility.

The data architecture uses multiple Iceberg tables organized by freshness and access patterns. Raw events flow into staging tables, are validated and cleaned, and then aggregated into feature tables optimized for ML access patterns.


-- Staging table for raw events
CREATE TABLE customer_events_staging (
    event_id STRING,
    customer_id BIGINT,
    event_type STRING,
    event_timestamp TIMESTAMP,
    event_properties MAP<STRING, STRING>,
    ingestion_timestamp TIMESTAMP
) USING ICEBERG
PARTITIONED BY (days(event_timestamp))
TBLPROPERTIES (
    'write.format.default' = 'parquet',
    'write.parquet.compression-codec' = 'snappy'
)

-- Feature table with optimized layout
CREATE TABLE customer_features (
    customer_id BIGINT,
    feature_date DATE,
    recency_days INT,
    frequency_30d INT,
    monetary_value_30d DOUBLE,
    support_tickets_30d INT,
    churn_probability DOUBLE,
    feature_version STRING
) USING ICEBERG
PARTITIONED BY (feature_date)
CLUSTERED BY (customer_id) INTO 16 BUCKETS

Feature engineering pipelines implement incremental processing using Iceberg’s merge capabilities. This incremental approach minimizes recomputation while maintaining data consistency across different processing schedules.


def incremental_feature_engineering(spark, processing_date):
    # Read new events since last processing
    new_events = spark.read.table("customer_events_staging") 
        .filter(f"event_timestamp >= '{processing_date}'")
    
    # Compute incremental features (implementation depends on business logic)
    new_features = compute_customer_features(new_events, processing_date)
    
    # Merge into feature table with upsert semantics
    new_features.writeTo("customer_features") 
        .option("merge-schema", "true") 
        .overwritePartitions()

def compute_customer_features(events_df, processing_date):
    """
    User-defined function to compute customer features from events.
    Implementation would include business-specific feature engineering logic.
    """
    # Example feature engineering logic would go here
    return events_df.groupBy("customer_id") 
        .agg(
            count("event_id").alias("event_count"),
            max("event_timestamp").alias("last_activity")
        ) 
        .withColumn("feature_date", lit(processing_date))

Performance Optimization

Query performance in Iceberg benefits from several complementary techniques. File sizing matters. Aim for 128MB to 1GB files depending on your access patterns with smaller files for highly selective queries and larger files for full-table scans.

Parquet provides natural benefits for ML workloads since you typically select column subsets. Compression choice depends on your priorities. Use Snappy for frequently accessed data (faster decompression) and Gzip for better compression ratios on archival data.

Data layout optimization using clustering or Z-ordering can dramatically improve performance for multi-dimensional access patterns. These techniques colocate related data within files, reducing scan overhead for typical ML queries.


-- Optimize table for typical access patterns
ALTER TABLE customer_features 
SET TBLPROPERTIES (
    'write.target-file-size-bytes' = '134217728',  -- 128MB
    'write.parquet.bloom-filter-enabled.customer_id' = 'true',
    'write.parquet.bloom-filter-enabled.feature_date' = 'true'
)

Metadata caching significantly improves query planning performance, especially for tables with many partitions. Iceberg’s metadata layer supports distributed caching (Redis) or in-memory caching within Spark executors.

Monitoring and Operations

Production ML systems need monitoring that goes beyond traditional infrastructure metrics. Iceberg’s rich metadata enables sophisticated monitoring approaches that actually help you understand what’s happening with your data.

Data quality monitoring leverages metadata to detect anomalies in volume, schema changes, and statistical distributions. Integration with frameworks like Great Expectations provides automated validation workflows that can halt processing when quality thresholds are violated.


import great_expectations as ge
from great_expectations.dataset import SparkDFDataset

def validate_feature_quality(spark, table_name):
    df = spark.read.table(table_name)
    ge_df = SparkDFDataset(df)
    
    # Define expectations
    ge_df.expect_table_row_count_to_be_between(min_value=1000)
    ge_df.expect_column_values_to_not_be_null("customer_id")
    ge_df.expect_column_values_to_be_between("recency_days", 0, 365)
    
    # Validate and return results
    validation_result = ge_df.validate()
    return validation_result.success

Performance monitoring tracks query execution metrics, file scan efficiency, and resource utilization patterns. Iceberg’s query planning metrics provide insights into partition pruning effectiveness and file-level scan statistics.

Don’t forget operational maintenance (e.g., file compaction, expired snapshot cleanup, and metadata optimization). These procedures maintain query performance and control storage costs over time. Set up automated jobs for this stuff – trust me, you don’t want to do it manually.

Best Practices and Lessons Learned

Choosing Your Table Format

When should you actually use Iceberg instead of other options? It’s not always obvious, and the marketing materials won’t give you straight answers.

Iceberg excels when you need strong consistency guarantees, complex schema evolution, and time travel capabilities. ML workloads particularly benefit from these features because of their experimental nature and reproducibility requirements.

Delta Lake provides similar capabilities with tighter integration into the Databricks ecosystem. If you’re primarily operating within Databricks or need advanced features like liquid clustering, Delta might be your better bet.

Apache Hudi optimizes for streaming use cases with sophisticated indexing. Consider it for ML systems with heavy streaming requirements or complex upsert patterns.

And you know what? Sometimes plain old Parquet tables are fine. If you have simple, append-only workloads with stable schemas, the operational overhead of table formats might not be worth it. Don’t over-engineer solutions to problems you don’t actually have.

Common Pitfalls

Over-partitioning is probably the most common mistake I see. Creating partitions with less than 100MB of data or more than ten thousand files per partition will hurt query planning performance. Monitor your partition statistics and adjust strategies based on actual usepatterns, rather than theoretical ideals.

Schema evolution mistakes can break downstream consumers even with Iceberg’s safety features. Implement schema validation in CI/CD pipelines to catch incompatible changes before deployment. Use column mapping features to decouple logical column names from physical storage. It will save you headaches later.

Query anti-patterns often emerge when teams don’t leverage Iceberg’s optimization capabilities. Include partition predicates in WHERE clauses to avoid unnecessary scans. Use column pruning by selecting only required columns instead of SELECT * (yes, I know it’s convenient, but your query performance will thank you).

Migration Strategies

Migrating from legacy data lakes requires careful planning. You can’t just flip a switch and expect everything to work. Implement parallel systems during transition periods so you can validate Iceberg-based pipelines against existing systems.

Prioritize critical ML datasets first, focusing on tables that benefit most from Iceberg’s capabilities. Use import functionality to migrate existing Parquet datasets without rewriting data files.


-- Migrate existing table to Iceberg format
CALL system.migrate('legacy_db.customer_features_parquet')

Query migration means updating SQL to leverage Iceberg features while maintaining backward compatibility. Feature flags or configuration driven table selection enable gradual rollout.

Pipeline migration should be phased. Start with batch processing before moving to streaming workflows. Iceberg’s compatibility with existing Spark APIs minimizes code changes during migration.

Wrapping Up

Apache Iceberg and SparkSQL provide a solid foundation for building ML systems that actually work reliably in production. The combination of time travel, schema evolution, and ACID transactions addresses fundamental data management challenges that have plagued ML infrastructure for years.

The investment pays off through improved development velocity, reduced debugging time, and increased confidence in system reliability. Teams consistently report better experiment reproducibility and faster time-to-production for new models.

But success requires thoughtful design decisions around partitioning, schema design, and operational procedures. The technology provides powerful capabilities, but you need to understand both the underlying architecture and the specific requirements of ML workloads to realize the benefits.

As ML systems grow in complexity and business criticality, reliable data foundations become increasingly important. Iceberg represents a mature, production-ready solution that helps organizations to build ML systems with the same reliability expectations as traditional enterprise applications.

Honestly? It’s about time we had tools that actually work the way we need them to work.

Building Reproducible ML Systems with Apache Iceberg and SparkSQL: Open Source Foundations

Key Takeaways

The ML Data Reproducibility Problem

The Usual Suspects

Why Traditional Data Lakes Fall Short

The Hidden Costs

Iceberg Fundamentals for ML

Time Travel That Actually Works

Schema Evolution Without the Drama

ACID Transactions (Finally!)

Building Reproducible Feature Pipelines

Partitioning That Actually Makes Sense

Data Versioning for Experiments

Feature Store Integration

Production Implementation

Real-World Example: Customer Churn Prediction

Performance Optimization

Monitoring and Operations

Best Practices and Lessons Learned

Choosing Your Table Format

Common Pitfalls

Migration Strategies

Wrapping Up

Leave a Reply Cancel reply

Stay Connected

Latest News

Fire Stick owners can claim free channels & films with list of five apps

Your group photos need a privacy filter and markup has one

FriendliAI raises $20M in funding to accelerate AI inference workloads – News

Your 2025 NFL Sunday Ticket Packages, Explained

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Key Takeaways

The ML Data Reproducibility Problem

The Usual Suspects

Why Traditional Data Lakes Fall Short

The Hidden Costs

Iceberg Fundamentals for ML

Time Travel That Actually Works

Schema Evolution Without the Drama

ACID Transactions (Finally!)

Building Reproducible Feature Pipelines

Partitioning That Actually Makes Sense

Data Versioning for Experiments

Feature Store Integration

Production Implementation

Real-World Example: Customer Churn Prediction

Performance Optimization

Monitoring and Operations

Best Practices and Lessons Learned

Choosing Your Table Format

Common Pitfalls

Migration Strategies

Wrapping Up

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News