How I Built A SOC 2-Compliant Cloud-Native Data Lake For Retirement Accounts

Let me describe the situation I walked into: a retirement plan provider managing hundreds of thousands of 401(k) participant accounts, with data spread across record-keeping engines, CRM platforms, and partner APIs. Product teams were running analytics from spreadsheet exports. Compliance reports took three days of manual work. And every SOC 2 audit was managed through a combination of compensating controls and retrospective documentation that made the engineering team nervous for good reason.

The assignment was to build a unified cloud-native data platform that could satisfy SOC 2 Type II requirements without sacrificing engineering velocity. Here is what I built, why I made the choices I did, and what I would do differently.

The Design Constraint That Changed Everything

Before selecting a single AWS service, I reframed the problem. Most teams approach compliance architecture by mapping their planned components to SOC 2’s Trust Services Criteria and checking boxes. I treated the Trust Services Criteria as a threat model instead.

Four threat scenarios shaped every architectural decision: (1) unauthorized lateral movement across data zones; (2) PII exposure through analytics tooling; (3) silent schema drift from upstream source systems; (4) tampering with historical audit records. If I could design a system that made each of those scenarios either impossible or immediately detectable, SOC 2 compliance would be a natural consequence—not a bolt-on.

That reframing matters more than it sounds. It produces different architecture than compliance-checklist design does. Specifically, it drives you toward systems that generate audit evidence automatically, rather than systems that require evidence to be assembled afterward.

Layer 1: Ingestion—Chain of Custody Starts Here

I used AWS Glue for batch extraction from structured sources and AWS Database Migration Service (DMS) for change data capture from transactional systems. Every Glue job is parameterized to produce a structured audit record at completion: source system identity, job run ID, extraction timestamp, and row counts. These records land in a separate audit log bucket before the raw data is written anywhere else.

Raw data lands in Amazon S3 with Object Lock enabled in Compliance mode. This is not optional: Compliance mode prevents modification or deletion even by the bucket owner or AWS support. For forensic needs—and for auditors who want to verify that historical data has not been altered—this is the foundation everything else rests on.

Layer 2: Orchestration—State Machines as Audit Trails

I chose AWS Step Functions over a traditional workflow orchestrator for one reason: execution history. Step Functions retains the full input/output state at every step of every execution. That means I can show an auditor exactly what data entered any stage of any pipeline, on any date, without reconstructing it from logs. CloudTrail provides the API-level audit trail—every AWS API call across the platform is logged with caller identity, timestamp, and parameters. Together, Step Functions and CloudTrail give you end-to-end traceability from a scheduled trigger to a written S3 object.

Layer 3: Storage and Governance—Lake Formation as the Authorization Plane

The storage architecture uses three S3 bucket zones: Raw (immutable source data), Curated (validated, schema-enforced), and Refined (business-ready, PII-scrubbed). I made an early decision that has paid off significantly: all access control lives in AWS Lake Formation, not in Glue jobs or Redshift views.

Lake Formation enforces access at the database, table, and column levels using tag-based policies. Tags are applied to columns at classification time—PII, Sensitive, Internal, Public. When an analyst queries the Refined zone in Redshift Spectrum or QuickSight, Lake Formation intercepts that query and filters columns that they are not authorized to see. They cannot see raw SSNs by crafting a clever SQL query because the authorization decision occurs before the storage layer responds. This satisfies SOC 2 CC6.1 without relying on developer discipline.

Layer 4: Transformation—Quality as a Compliance Control

Here is a framing that has changed how I think about data quality: quality check results are evidence of compliance, not just operational metrics. Under SOC 2 Processing Integrity, auditors want to see not only that your data is correct, but that your system would have detected and isolated incorrect data if it had appeared. That means quality check results must be stored as queryable records—not just pipeline logs.

I implemented this using AWS Glue Data Quality for infrastructure-level checks (row counts, null rates, referential integrity) and dbt tests for model-level semantic validation. Every job that fails a quality check writes its failure record to a dedicated results table and routes to a dead-letter queue. The job stops; it does not write bad data to the Curated or Refined zones. That fail-visible design is what makes quality a compliance control rather than an engineering nicety.

Layer 5: Consumption—No Shared Accounts, Full Isolation

Amazon QuickSight serves business users in the Refined zone with row-level security rules enforced via dataset rules—a user in the retirement services team sees only the plan data their role permits. Redshift Spectrum supports more complex analytical queries within a VPC with IAM user mapping. There are no shared service accounts in this architecture. Every human and every application authenticates under a role scoped to minimum necessary permissions. This is a specific SOC 2 CC6.1 requirement and also just good security hygiene.

Five Engineering Lessons from Production

After running this platform through a full SOC 2 Type II audit cycle, here are the five things I would tell any engineer building a regulated data platform:

1. Build for auditability from day zero. Retrofitting CloudTrail or column-level security onto an existing Glue/Redshift architecture is significantly more disruptive than building it in from the start. The cost of retroactive auditability is measured in weeks of engineering time and months of audit anxiety.

2. Treat IAM as a first-class schema. Every Glue job, every Redshift user, every Lambda function should operate under a role scoped to exactly what it needs. Design your IAM policy structure with the same rigor you apply to your data schema. Overly broad roles reduce blast radius from incidents and dramatically simplify audit scoping.

3. Separate authorization from transformation. Put access control in Lake Formation. Do not put it in Glue scripts or dbt models. When access control is in ETL code, it is invisible to the authorization layer, inconsistently applied, and nearly impossible to audit. When it is in Lake Formation, it is enforced uniformly, logged automatically, and auditable without asking the engineering team to reconstruct what any given user could see.

4. Store quality check results as queryable data. Glue job logs are not sufficient for SOC 2 Processing Integrity. Auditors want to query your quality check history the same way they would query your transaction history. Write those results to a table, not just to CloudWatch.

5. Build governance concurrently. The data ownership assignments, the classification scheme, and the GitOps workflow for IAM changes—none of these can be introduced six months after the platform goes live and expect to be adopted. Build them into the operational rhythm from the first sprint.

What the Numbers Looked Like After Go-Live

Quarterly compliance report: from three days to under two hours. New 401(k) plan data feed onboarding: from six weeks to three days. SOC 2 Type II audit result: no material control deficiencies in any area governed by the platform. The architecture worked the way a good architecture should—it made the hard thing easy, and the easy thing automatic.