How I'd Design A Data Platform From Scratch In 2026

Last quarter, I got the rare opportunity every data engineer secretly dreams about: a blank slate. A growing Series B startup hired me to build their data platform from scratch. No legacy Airflow DAGs to untangle. No mystery S3 buckets full of abandoned Parquet files. No “we’ve always done it this way.”

Just a question: What do we need to build so this company can make decisions with data?

This is the story of what I chose, what I skipped, and—most importantly—why. If you’re starting fresh or considering a major re-architecture, I hope my reasoning saves you some of the dead ends I’ve hit in previous roles.

The Starting Context

Before I made any technology decisions, I spent two weeks just listening. I talked to the CEO, the head of product, the sales lead, the marketing team, and the three backend engineers. I wanted to understand:

What decisions are people trying to make today without data?
Where does data currently live?
What’s the expected data volume in 12 months? In 3 years?
What’s the team budget and headcount reality?

Here’s what I learned. The company had a PostgreSQL production database, a handful of third-party SaaS tools (Stripe, HubSpot, and Segment), about 50 million events per day flowing through Segment, and exactly one person who would be managing this platform day-to-day: me. With a part-time analytics hire planned for Q3.

That last point shaped every decision. This wasn’t a platform for a 15-person data team. It had to be operable by one engineer without turning into a second full-time job just to keep the lights on.

Layer 1: Ingestion—Keep It Boring

The temptation when you’re building from scratch is to reach for the most powerful tools. I’ve made that mistake before—setting up Kafka for a workload that could’ve been handled by a cron job and a Python script.

This time I went boring on purpose.

For SaaS sources (Stripe, HubSpot): I chose Fivetran. Yes, it’s a managed service, and it costs money. But writing and maintaining custom API connectors for a dozen SaaS tools is a full-time job I didn’t have headcount for. Fivetran syncs reliably, handles API pagination and rate limiting, manages schema changes, and pages me only when something genuinely breaks.

For event data: Segment was already in place, so I configured it to dump raw events directly into the warehouse. No custom event pipeline. No Kafka. Not yet.

For the production database: I set up a simple Change Data Capture (CDC) pipeline using Airbyte, replicating key PostgreSQL tables into the warehouse on a 15-minute schedule. Airbyte’s open-source version ran on a small EC2 instance and handled our volume without breaking a sweat.

What I deliberately skipped: Kafka, Flink, Spark Streaming, and anything with the word “real-time” in the sales pitch. Our business didn’t need sub-second data freshness. Fifteen-minute latency was more than sufficient for every use case anyone could articulate. I’ve seen too many teams build a real-time streaming infrastructure and then use it to power a dashboard that someone checks once a day.

The rule I followed: don’t build for the workload you imagine. Build for the workload you have, with a clear upgrade path to the workload you expect.

Layer 2: Storage and Compute—The Warehouse Wins (For Now)

After my experience with lakehouse hype (I’ve written about this before), I made a pragmatic call: BigQuery as the central warehouse.

Why BigQuery over a lakehouse setup?

Operational simplicity. BigQuery is serverless. No clusters to size, no Spark jobs to tune, and no infrastructure to manage. For a one-person data team, this matters enormously. Every hour I spend managing infrastructure is an hour I’m not spending on modeling data or building dashboards.

Cost predictability. With on-demand pricing and a modest reservation for our known workloads, our monthly bill was predictable and reasonable for our data volume.

The ecosystem. BigQuery integrates natively with Fivetran, dbt, Looker, and basically every BI tool. No glue code needed.

What I’d reconsider: If our event volume grows past 500 million events per day, or if the ML team (currently nonexistent) needs to run training jobs on raw event data, I’d revisit this. The upgrade path would be landing raw data in GCS as Parquet (via Segment’s GCS destination) and layering Iceberg on top, while keeping BigQuery as the serving layer. But that’s a problem for future me, and I’m not going to build for it today.

Layer 3: Transformation—dbt, No Contest

This was the easiest decision. dbt Core, running on a schedule, transforms raw data into analytics-ready models inside BigQuery.

I’ve used Spark for transformations. I’ve used custom Python scripts. I’ve used stored procedures (dark times). For structured analytical transformations at our scale, nothing comes close to dbt in terms of productivity, testability, and maintainability.

Here’s how I structured the project:

Staging models—one-to-one with source tables. Light cleaning: renaming columns, casting types, and filtering out test data. Every staging model has a schema test for primary key uniqueness and not-null on critical fields.

Intermediate models — business logic lives here. Joining events to users, calculating session durations, and mapping Stripe charges to product plans. These are where the complexity hides, so I document every model with descriptions and column-level docs.

Mart models are final, business-friendly tables organized by domain. mart_finance.monthly_revenue, mart_product.daily_active_users, mart_sales.pipeline_summary. These are what analysts and dashboards query directly.

What made this work: I wrote dbt tests from day one. Not retroactively, not “when I have time.” From the first model. unique, not_null, accepted_values, and relationships tests caught three upstream data issues in the first month alone — before any analyst ever saw the bad data.

Layer 4: Orchestration — Dagster Over Airflow

This is where I broke from my own habits. I’ve used Airflow for years. I know it deeply. But for a greenfield project in 2026, I chose Dagster.

Why the switch:

Dagster’s asset-based model maps to how I actually think about data. Instead of defining “tasks that run in order,” I define “data assets that depend on each other.” It sounds like a subtle difference, but it changes how you debug problems. When something breaks, I look at the asset graph and immediately see what’s affected. In Airflow, I’d be tracing task dependencies across DAG files.

Dagster’s built-in observability is excellent. Asset materialization history, freshness policies, and data quality checks are first-class features, not bolted-on extras.

The development experience is better. Local testing, type checking, and the dev UI make iteration fast. In Airflow, my dev loop was: edit DAG file, wait for scheduler to pick it up, check the UI for errors, repeat. In Dagster, I run assets locally and see results immediately.

What I miss about Airflow: The community is massive. Every problem has a Stack Overflow answer. Dagster’s community is growing fast, but it’s not at that scale yet. There have been a few times I’ve had to dig through source code instead of finding a quick answer online.

My orchestration setup: Dagster Cloud (the managed offering) runs my dbt models, Airbyte syncs, and a handful of custom Python assets. Total orchestration cost is less than what I’d spend on the EC2 instances to self-host Airflow, and I don’t have to manage a metadata database or worry about scheduler reliability.

Layer 5: Serving and BI — Invest in the Semantic Layer

For BI, I chose Looker (now part of Google Cloud). Controversial in some circles because of the LookML learning curve, but here’s why it was the right call for us.

The semantic layer is the product. LookML lets me define metrics, dimensions, and relationships in version-controlled code. When the CEO asks “what’s our MRR?” and the sales lead asks the same question, they get the same number. Not because they’re looking at the same dashboard, but because the metric is defined once, in one place.

I’ve lived through the alternative: a BI tool where anyone can write their own SQL, and three people calculate revenue three different ways. That’s not a tooling problem. It’s a semantic layer problem. And Looker solves it better than any tool I’ve used.

What I’d also consider: If I were optimizing for cost or for a less technical analyst team, I’d look at Metabase (open source, SQL-native, dead simple) or Evidence (code-based BI, great for a data-engineer-heavy team). Looker’s pricing isn’t cheap, and the LookML abstraction is overkill if your team just wants to write SQL and make charts.

Layer 6: The Stuff Nobody Talks About

Here’s the part that doesn’t show up in architecture diagrams but made the biggest difference.

A Data Catalog From Week One

I set up a lightweight data catalog using dbt’s built-in docs site, deployed as a static page on our internal wiki. Every mart model has a description. Every column has a definition. It took about 30 minutes per model to document, and it’s already saved hours of “hey, what does this column mean?” Slack messages.

An Incident Response Process

When data breaks — and it does — there’s a clear process. An alert fires in Slack. I acknowledge it within 30 minutes during business hours. I post a status update in the #data-incidents channel. When it’s resolved, I write a brief post-mortem: what broke, why, and what I’m doing to prevent it.

This sounds like overkill for a one-person team. It’s not. It builds trust. When stakeholders see that data issues are handled transparently and quickly, they trust the platform. Trust is the most important metric a data platform has.

An Architecture Decision Record (ADR) for Every Major Choice

Every technology decision I made — BigQuery over Snowflake, Dagster over Airflow, Fivetran over custom connectors — has a one-page ADR in our repo. Each one explains the context, the options I considered, the decision, and the trade-offs.

When the next data engineer joins (hopefully soon), they won’t have to reverse-engineer why things are the way they are. They can read the ADRs, understand the reasoning, and make informed decisions about what to change.

What I’d Do Differently Next Time

No build is perfect. Three months in, here’s what I’d adjust.

I’d set up data contracts earlier. I wrote about schema validation in a previous article, and I should’ve practiced what I preached. We had two incidents where backend engineers changed column types in PostgreSQL without telling me. A formal data contract between the backend team and the data platform would’ve caught this at deploy time instead of at sync time.

I’d invest in a reverse ETL tool sooner. The sales team wanted enriched data pushed back into HubSpot within the first month. I hacked together a Python script, but a tool like Census or Hightouch would’ve been cleaner and more maintainable.

I’d timebox the BI tool decision. I spent three weeks evaluating BI tools. In hindsight, two of those weeks were diminishing returns. The key requirements were clear after week one. I should’ve committed sooner and iterated.

The Full Stack, Summarized

Total monthly cost: Under $3,000 for the entire platform, supporting ~50M events/day and a growing team of data consumers.

The Takeaway

The best data platform isn’t the one with the most sophisticated architecture. It’s the one that actually gets used, actually gets trusted, and can actually be maintained by the team you have — not the team you wish you had.

If I could distill everything I learned from this build into one principle, it’s this: pick boring tools, invest in trust, and leave yourself an upgrade path. The fancy stuff can come later. The fundamentals can’t wait.

How I’d Design a Data Platform From Scratch in 2026 | HackerNoon

The Starting Context

Layer 1: Ingestion—Keep It Boring

Layer 2: Storage and Compute—The Warehouse Wins (For Now)

Layer 3: Transformation—dbt, No Contest

Layer 4: Orchestration — Dagster Over Airflow

Layer 5: Serving and BI — Invest in the Semantic Layer

Layer 6: The Stuff Nobody Talks About

A Data Catalog From Week One

An Incident Response Process

An Architecture Decision Record (ADR) for Every Major Choice

What I’d Do Differently Next Time

The Full Stack, Summarized

The Takeaway

Leave a Reply Cancel reply

Stay Connected

Latest News

Rivian Introduces Apple Watch App

GPU Infrastructure Is Becoming an Asset Class — Here’s Why Crypto Investors Are Paying Attention | HackerNoon

The Best of KBIS 2026: Hidden Cooktops, Ice Makers and More Industry Firsts

The 10 best new tips for iOS 26.3, iPadOS 26.3 and macOS 26.3 | Stuff

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

The Starting Context

Layer 1: Ingestion—Keep It Boring

Layer 2: Storage and Compute—The Warehouse Wins (For Now)

Layer 3: Transformation—dbt, No Contest

Layer 4: Orchestration — Dagster Over Airflow

Layer 5: Serving and BI — Invest in the Semantic Layer

Layer 6: The Stuff Nobody Talks About

A Data Catalog From Week One

An Incident Response Process

An Architecture Decision Record (ADR) for Every Major Choice

What I’d Do Differently Next Time

The Full Stack, Summarized

The Takeaway

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News