Most teams treat compliance like a speed bump you thud over on the way to “done.” I treat it as a design input—same tier as throughput, cost, and fixity. If you bake it in, you stay fast and auditable. If you bolt it on, you get amber: slow, sticky, and permanently behind.
Snarky aside, if your compliance story is a binder, not a button, you don’t have compliance—you have décor.
Verification windows and offsite strategy are getting tighter while estates get messier (hybrid everything, multiple vendors, historical cruft). If your 3-2-1 policy isn’t operational, it’s a slogan. At PB scale, that gap burns years and budgets: you carry dual stacks, retrace old ingest, and rehearse restores you can’t prove.
My real-world constraint (and why it’s messy)
I run with an “artificial SLO” on 3-2-1 today—best effort—because of three factors:
- Lack of automation — There’s no unified toolchain that gracefully drives multiple products across heterogeneous environments. Glue code fills the gaps; glue cracks.
- Staffing reality — You need more than a few admins. You need operators and developers to keep three locations honest and the pipeline healthy.
- Management expectations — Leadership initially underestimated the effort to meet verification/offsite SLOs at scale. The result: optimism debt.
Yes, you could run an entire company just mapping this problem out, building the automation, and keeping it honest. (Oh right—that’s the job.)
What the pipeline looks like today
- Copy-1 and Copy-2 ==(minutes → hours)==: Two locations land asynchronously, driven by an HSM.
- Copy-3 ==(geo/off-account, daily)==: A script pushes the previous day’s archive to a geo-dispersed location. We send raw files, not containers, to keep failure modes independent (3-2-1 means independence, not just count).
- Sanity check ==(monthly)==: We compare the local tree to the geo tree, find deltas (there are always deltas), and reconcile.
- Verification grind ==(continuous)==: transfer → verify (hash vs. recorded fixity) → tag (provenance + verification time) → ILM release or route to one of three failure states: bad hash, bad transfer, not in manifest.
Now imagine doing that for 1.2 billion files (~32 PB). Management once thought it could be done manually in 2 years. Seven years later, we’re still validating and correcting edge cases, some reaching back a decade. That’s not incompetence; that’s the cost of retrofitting proof onto history while keeping the lights on.
How to balance agility with regulatory demands or expectations
- Turn policy into SLOs you can measure
- Set explicit verification windows: e.g., Copy-2 verified ≤24 h; Copy-3 verified ≤7 d; 1% rolling monthly re-hash across age/size prefixes.
- Publish verification debt: written – independently verified. If it grows for >7 days, you’re borrowing against luck.
- Trip incident mode on mismatch rate (say, >0.01% per 100k assets in 24 h).
- Make independence real
- Offsite = different account/tenant/control plane. If the same IAM blast radius kills it, it’s not offsite.
- Push raw files (not tarballs) so corruption is granular and restore is surgical.
- Practice restore drills with capped egress (e.g., 10 TB/day × 3 days) so “DR” isn’t just PR.
- Fixity is first-class metadata
- Compute/store checksums at first touch and propagate forward.
- If object storage is involved, persist multipart part size/count so you can recompute synthetic ETags without re-downloading.
- Treat verification as a state machine, not a script: idempotent retries, poison-pill isolation, and clear escalation.
- Automate the boring, narrate the risky
- Automate tree-diffs, manifests, and re-tries. Humans should adjudicate edge cases, not babysit queues.
- Keep control plane separate from data plane: scheduler hiccups shouldn’t corrupt writes; data-path stumbles shouldn’t nuke the queue.
- Replace binders with buttons
- One-click report: “Show last verified time + method for Asset X on Copy-3”. If you can’t answer that in one screen, you owe yourself a table.
- Tag everything with provenance: source, hash family, ingest era/policy version. Future you needs to know which rules applied.
- Staff like you mean it
- Budget for operators + developers. The delta between “best effort” and “SLO met” is automation—and automation has owners.
- Clarify on-call scope and escalation: who wakes up when the debt curve bends the wrong way?
What good looks like (directionally)
- Copy-2 validated within 24 h; Copy-3 within 7 d.
- Rolling monthly 1% re-hash across diverse prefixes.
- Verification debt zeroed at least weekly; anything older becomes a ticket with an owner.
- Restore drills passing within budget and time caps.
- A dashboard that makes auditors bored. (This is the goal.)
Common traps
- “We’ll hash later.” You won’t. Hash now; propagate forever.
- “Two buckets = offsite.” Same account/keys/control plane means correlated failure.
- “We containerize the geo copy.” Great for throughput, terrible for independence and surgical restores.
- “Ops will catch it.” Give ops state machines and runbooks, not vibes and hope.
Compliance that’s designed in keeps you fast, honest, and fundable. Where does your balance break first—verification window, offsite independence, or staffing/automation? If you had to prove Copy-3 existed and was intact for 50 TB by Friday, could you push the button—or would you grab the binder?
:::tip
Article Original Posted on LinkedIn September 25, 2025
:::
n
