I Built The Same Data Pipeline 4 Ways. Here's What I'd Never Do Again.

dbt, Airflow, Spark, and plain SQL walked into a bar. Only one of them didn’t cause an incident at 2 am.

It started with a straightforward ask.

Our analytics team needed a daily pipeline: pull raw event data from S3, clean it, join it against a customer dimension table, aggregate it into a revenue summary, and land it in the warehouse by 7 am so the business had numbers for their morning standup.

Simple enough. Except I had four strong opinions in my head and a rare window of time where I could actually test them properly.

So instead of picking one approach and moving on, I built the same pipeline four times. Same source data. Same business logic. Same destination. Different tools, different architectures, different everything else.

Three months later, I have opinions I didn’t have before. Some of them surprised me.

Why This Even Matters

Every data team I’ve worked with has the same unspoken war going on: the engineers want Spark, the analytics engineers want dbt, the data scientists want Airflow DAGs they can control themselves, and somewhere a senior engineer has a 3,000-line SQL file they wrote in 2019 that nobody is allowed to touch because it somehow still works.

These debates are usually settled by whoever speaks loudest in the architecture meeting, not by evidence. I wanted evidence.

The Pipeline

Here’s what the logic actually needed to do:

Read raw clickstream events from S3 (JSON, roughly 40 million rows per day, schema drifts occasionally). Deduplicate events, since the same event_id can appear multiple times due to at-least-once delivery. Filter to only revenue-generating event types. Join against a slowly-changing customer dimension table (Type 2 SCD). Aggregate to daily revenue per customer segment. Write the result to a warehouse table with incremental logic so we’re not reprocessing everything every day.

Nothing exotic. Exactly the kind of pipeline that exists in thousands of companies right now.

Attempt 1: Plain SQL

I’ll be honest. I almost didn’t include this one. It felt too old-fashioned to take seriously.

I was wrong to feel that way.

The SQL version took the least time to write. The business logic was completely transparent. Anyone who could read SQL could follow it. The deduplication was a ROWNUMBER() window function. The SCD join was a date-bound BETWEEN. The incremental logic was a WHERE eventdate = CURRENT_DATE – 1.

It ran in 4 minutes in our warehouse. Costs almost nothing. When it breaks, the error message tells you exactly which line failed.

What went wrong: schema drift broke it on day 11. A new event type arrived with a field that had a different data type than expected, and the pipeline failed silently. It completed successfully. It just dropped those rows. I didn’t know for two days. There was no alerting, no data quality check, no contract enforcement. Just missing revenue numbers that everyone assumed were a business dip.

That was the lesson. SQL is not the problem. The lack of structure around SQL is the problem. When something goes wrong, you’re debugging alone with SELECT * and a prayer.

Would I use it again? For stable, well-understood pipelines with strong testing upstream, yes. For anything touching raw external data, absolutely not without wrapping it in something with validation and observability.

Attempt 2: Apache Airflow

Airflow felt like the responsible adult choice. Scheduling, retries, dependencies, a UI you can actually look at. What’s not to like?

Building the DAG took longer than I expected. Not because Airflow is hard, but because I kept running into the gap between the Python that orchestrates the work and the Python that does the work. Airflow is an orchestrator, not an execution engine. That’s the right design, but it meant my actual transformation logic lived in a jumble of SQL strings embedded in Python operators, and that combination is harder to test than either pure SQL or pure Python would be.

The SCD in particular got ugly. Expressing slowly-changing dimension logic in a way that’s both correct and readable inside a PythonOperator is an exercise in patience.

What actually worked: the observability was genuinely excellent. When the S3 read failed on day 6 because of a permissions issue, I had an email in my inbox before I’d finished my coffee. The retry logic handled two transient failures automatically. The audit trail told me exactly when each task ran, how long it took, and what the inputs were.

What went wrong: dependency management became a burden fast. The Airflow environment needed specific package versions, and two of those packages conflicted with each other in ways that took an entire afternoon to diagnose. And the DAG code became hard to read quickly. By the end, it was 380 lines of Python to express logic that took 90 lines of SQL.

Also, testing Airflow DAGs locally is a genuinely miserable experience. I spent more time fighting the local environment than writing business logic.

Would I use it again? Yes, but as pure orchestration, calling out to dbt or a separate transformation layer. Never again as the place where transformation logic lives.

Attempt 3: Apache Spark

Spark is the one who impresses people in interviews. It’s also the one that humbled me the most.

The PySpark code was actually elegant. Dataframe operations for deduplication, a broadcast join for the customer dimension since that table fits in memory, wand indowing functions for the SCD logic. When it worked, it was fast and expressive.

For 40 million rows a day, Spark is overkill. I knew this going in. But I wanted to understand the real cost of that overkill, not just assume it.

The real cost: cluster startup time added 4 to 6 minutes to every run before a single row was processed. My SQL version ran in 4 minutes total. The Spark version ran in 3 minutes of actual processing plus 5 minutes of cluster initialization. So it was “faster” in the wrong way.

The operational overhead was the bigger problem. Spark has opinions about memory management that it enforces aggressively and explains poorly. I hit an out-of-memory error on day 8 during a backfill run, and diagnosing it required reading Spark executor logs in a UI that feels like it was designed to discourage you.

What actually worked: the schema enforcement was excellent. I defined a schema explicitly, and when the drift event hit on day 11, the same one that silently broke the SQL version, Spark threw an error immediately and loudly. Nothing got silently dropped.

Would I use it again? For this use case, no. For data at 10x or 100x this volume, or for complex transformations that a SQL engine can’t express cleanly, yes. Spark earns its complexity at scale. At 40 million rows, it’s a sports car you’re using to drive to the grocery store.

Attempt 4: dbt

I’ll admit my bias upfront. I came into this expecting dbt to win. I was not entirely right, and the ways I was wrong were instructive.

The dbt version felt the most like software engineering. Models are files. Files are version-controlled. Tests are declarative. The documentation is generated automatically. When I wrote the revenue aggregation model, I could see exactly which upstream models it depended on, and dbt would refuse to run it if those models hadn’t succeeded first.

The SCD join was clean. dbt’s snapshot feature handles Type 2 SCDs with a configuration block, no manual date-bounding required. The incremental model logic was readable and explicit. The deduplication was a simple macro.

What actually worked: the governance story was the best of any approach. Column-level descriptions, source freshness alerts, test coverage on every model, and a lineage graph I could show to a non-technical stakeholder. When the schema drift hit, a dbt test caught it before the model ran.

What went wrong: dbt is opinionated and sometimes its opinions fought mine. Incremental logic that seems simple in concept has edge cases that the is_incremental() macro handles in ways that aren’t always obvious. I hit a subtle issue where late-arriving events from the previous day weren’t being reprocessed correctly and debugging it required understanding dbt internals I hadn’t needed to know before.

The other thing: dbt runs SQL. It does not run Python. For anything that genuinely needs Python, complex ML feature engineering, calling external APIs mid-pipeline, you’re reaching for dbt-python models or pushing that logic elsewhere. That boundary trips people up.

Would I use it again? Yes, and it’s now my default for analytical pipelines in a warehouse environment. But I’d pair it with an orchestrator for scheduling and alerting, and I’d be honest with my team that dbt’s learning curve is real and the documentation, while good, assumes you already think a certain way about data modeling.

The Scorecard

What I’d Never Do Again

Never embed transformation logic inside an Airflow DAG. Airflow is an orchestrator. Treat it like one. The moment your SQL lives inside a Python string inside an operator, you’ve created something harder to test than SQL and harder to debug than Python. Keep orchestration and transformation separate.

Never use Spark below a certain data volume threshold without justifying it in writing. Not because Spark is bad, but because the operational tax is real and it compounds. Every new team member needs to learn it. Every environment needs to support it. Every incident takes longer to diagnose. That’s fine if you’re processing petabytes. It’s a waste if you’re not.

Never deploy plain SQL against raw external data without schema contracts. SQL is fast to write and easy to read. It is not self-defending. One upstream schema change and you’re flying blind. If you’re going to use SQL, pair it with something that enforces contracts and makes failures loud.

Never pick a tool based on what’s impressive. Spark impresses people. It impressed me in the demo phase. But the right tool is the one your team can operate, debug, and extend at 2 am when something breaks, and the business is waiting on numbers. Boring infrastructure is underrated.

The Real Lesson

The pipeline wasn’t the hard part. It never is.

The hard part is what happens after the pipeline runs. Is the data trustworthy? Would you know if it wasn’t? Can someone else on your team understand what it’s doing? Can you audit it when the business asks why a number changed?

None of the four approaches nailed all of that on their own. The one that came closest was dbt, not because it’s magic, but because it’s designed around the assumption that data pipelines exist in organizations and organizations need documentation, testing, and lineage more than they need raw speed.

Three months of building the same thing four ways taught me less about tools than I expected and more about what questions to ask before picking one.

Start with: what does “this broke” look like at 2 am, and who’s going to fix it?

The answer tells you everything about which tool you should choose.

I Built the Same Data Pipeline 4 Ways. Here’s What I’d Never Do Again. | HackerNoon

Why This Even Matters

The Pipeline

Attempt 1: Plain SQL

Attempt 2: Apache Airflow

Attempt 3: Apache Spark

Attempt 4: dbt

The Scorecard

What I’d Never Do Again

The Real Lesson

Leave a Reply Cancel reply

Stay Connected

Latest News

Cooler Master MasterFrame 360 Stage LCD Review: All This PC’s World Is a Stage

Build a 5-star hotel bed at home with these luxury bedding picks

That Circle On Your Car’s Dashboard Is Actually A Sensor – Here’s What It Does – BGR

3 Bluetooth Gadgets That Work With Your Amazon Fire TV Stick – BGR

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Why This Even Matters

The Pipeline

Attempt 1: Plain SQL

Attempt 2: Apache Airflow

Attempt 3: Apache Spark

Attempt 4: dbt

The Scorecard

What I’d Never Do Again

The Real Lesson

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News