After years of searching, there is still no cure for Digital Disposophobia
This is first in a series that try’s to dig deeper into the Data Migration Tax
Everyone “has” a 3-2-1 strategy—until they try to restore something ugly at 2 a.m. and discover they actually had 1-0-0 with vibes. The sticker on the whiteboard says three copies, two media, one offsite. The real world says: data lives in three simultaneous states (ingest, in-flight, verify), storage systems are not saints, and “offsite” that can’t be validated on demand is just an expensive daydream.
Why it matters
Migrations and long-term preservation don’t fail because we forgot a slogan. They fail because the pipeline and the policy are out of sync. If your second and third copies don’t move at the speed of verification, you’re left with Schrödinger’s archive—probably fine, unprovably so. At multi-petabyte scale, that gap becomes budget-eating: you pay for hardware, people, and power twice while your “redundant” copies slowly converge (or don’t). The fix is a 3-2-1 that is operational, not ornamental.
What people miss
- Copies aren’t independent if the process isn’t. If Copy-2 and Copy-3 both derive from the same unverified staging area or rely on the same fragile job runner, that’s correlated risk masquerading as redundancy.
- “Cloud offsite” isn’t a backup if you treat it like a black hole. If you don’t routinely re-hydrate, list, and verify sample sets (or manifests) from cloud tiers, you’ve outsourced your peace of mind to marketing.
- Fixity has to be first-class metadata. Checksums (and, for objects, multipart ETag awareness) belong with the file from the first touch—not bolted on later with a heroic summer intern.
- People are the third medium nobody budgets. Dual-stack years, on-call rotas, and exception triage are part of 3-2-1. If it takes a wizard to restore, you didn’t buy redundancy—you bought hero worship.
- Lifecycle ≠ longevity. Auto-tier policies are not preservation policies. Your hot/cold rules should follow validation windows and risk, not just last-access dates.
Make 3-2-1 operational: the playbook
- Define independence properly.
- Fixity-first ingestion.
- Verification windows as SLOs.
- Staging that doesn’t lie.
- Cloud offsite with teeth.
- Tape as a stabilizer, not a scapegoat.
- Runbook the hard parts.
A simple architecture that works
- Copy-1 (Primary Service): fast object or POSIX store with fixity-on-write, immediately hash-validated.
- Copy-2 (Stabilizer Medium): tape (or other long-lived medium) written from validated data with a manifest; periodic scrub reads.
- Copy-3 (Offsite): separate provider/region/account with routine restore-and-verify drills and cost-capped egress rehearsals.
- Glue: a small database (or log-structured store) that tracks asset → checksums → copy locations → verification events. If you can’t answer “when was the last time we proved Copy-3 existed for this asset?” in one query, build that table today.
Numbers that keep you honest (order-of-magnitude is fine)
- Verification debt: (# assets written minus # assets independently verified). If it grows for more than a week, you’re borrowing against luck.
- Mismatch threshold: agreed-upon rate that trips incident mode (e.g., >0.01% per 100k assets in a 24-hour window).
- Restore SLOs: class-based targets (e.g., “tier-A: 4-hour partial, 24-hour full,” “tier-B: 24-hour partial, 5-day full”).
- People math: shifts × pipelines × interventions/hour. If the model assumes zero human touches, it’s fiction.
Common anti-patterns (and what to do instead)
- “We’ll hash later.” → You won’t. Hash now, once, at first contact; propagate the value.
- “The cloud copy is our DR.” → Prove it with restores. DR you never rehearse is PR.
- “Two buckets, one account.” → That’s not offsite; that’s folders with delusions of grandeur.
- “The wizard knows the steps.” → Put the wizard’s brain in the runbook, then take a vacation. If anxiety spikes, the runbook is lying.
If your 3-2-1 fits in a one-slide diagram and not in a runbook, you don’t have a strategy—you have clip art.
So what / CTA
Where does your 3-2-1 actually break—in fixity, independence, or rehearse-to-recover? If you had to restore 50 TB by Friday with auditors watching, what fails first: people, pipeline, or provider? Drop your failure story (names optional); I’ll trade you a checklist.
Question: ==How do you balance agility with regulatory demands effectively?==
Short Answer:
Treat compliance as a design input, not an after-the-fact speed bump. The balance is: policy → pipeline → proof.
- Policy (what): Define verification windows and offsite rules as SLOs that map to the regs you care about (immutability/retention, independence, and evidence). Example: “Copy-2 verified ≤24h; Copy-3 (geo/off-account) verified ≤7d; 1% rolling sample/month with full re-hash.”
- Pipeline (how): Make independence real. Our first two copies land minutes–hours apart via HSM, but the geo copy is pushed as raw files (not a container) to keep failure modes independent. We follow with a structure diff to catch the inevitable deltas.
- Proof (show me): Every asset carries fixity at first touch; verification is a state machine, not a hope:
Controls that keep you agile
- Button, not binder: one-click report that answers, “When was Copy-3 last proven for Asset X?”
- SLOs ≠ best-effort: publish debt (assets written – assets independently verified). If it grows >7 days, you’re borrowing against luck.
- Offsite independence: separate account/tenant + restore drills (cost-capped) so “offsite” isn’t just a billing line.
Snarky aside: if your compliance story is a binder, not a button, you don’t have compliance—you have décor.
