The Disaster That Changed Everything
In October 2019, we made a critical mistake that led to 10,000 important files disappearing overnight. It was a disaster—one that could have ruined our business. But five years later, that same experience saved our new company from an even bigger crisis.
This is a story about data loss, misconfigurations, and the hard lessons that led us to build a bulletproof backup system. If you’re running a system that stores critical data, this could help you avoid making the same mistakes.
Background: How Gama Stored Data
Gama (gama.ir) is a K-12 educational content-sharing platform launched in 2014 in Iran, with over 10 million users worldwide. It provides services such as:
- Past Papers
- Tutorials & Learning Resources
- Online Exams & School Hub
- Live Streaming & Q&A Community
- Tutoring Services
Since our content was user-generated, maintaining secure file storage was a top priority. We used MooseFS, a distributed file system with five nodes and a triple-replication model, ensuring redundancy.
Our Backup Strategy
A simple external HDD where we stored copies of every file. It worked fine, and we rarely needed it. But then, we made a dangerous assumption.
The Migration That Led to Disaster
One of our engineers suggested migrating to GlusterFS, a more well-known distributed file system. It sounded great—more scalability, higher adoption, and seemingly better performance. After evaluating the cost-benefit tradeoff, we decided to switch.
Two months later, the migration was complete. Our team was thrilled with the new system. Everything seemed stable… until it wasn’t.
There was just one small problem:
Our backup HDD was 90% full, and we needed to make a decision.
The Mistake
Because we had never really needed our full backups before, we assumed GlusterFS was reliable enough.
We removed our old backup strategy and trusted GlusterFS replication.
That was a bad decision.
The Day Everything Went Wrong
Two months later, one morning, we started receiving reports: some files were missing.
At first, we thought it was a network glitch—something minor. But as we dug deeper, we found that Gluster was showing missing chunks and sync errors.
- Files were disappearing.
- More and more pages were throwing errors.
- It was spreading fast.
The Immediate Response
3:30 AM: We decided to restart the Gluster network, believing a fresh bootstrap would fix the problem. At first, it seemed to work!
We thought we had solved it.
Then, a WhatsApp message from the content team came in:
“The files are empty.”
Wait, what? The files existed, but they contained nothing.
We checked manually. The files still had size and metadata, but when we opened them, they were completely blank.
10,000 files were gone.
The Backup That Was Useless
We had a backup HDD. That should have saved us, right?
Wrong. Because after migrating to GlusterFS, we had restructured our directory system. Every file had a new hashed path in the database.
Our old backups were useless because they had different filenames.
We tried multiple recovery methods. Nothing worked.
In the end, we had to email thousands of users, asking them to re-upload their lost files.
It was a nightmare. But it forced us to rethink everything.
How We Fixed It: Introducing Gama File Keeper (GFK)
After this disaster, we completely redesigned our storage and backup strategy. Our solution had two parts:
1. Gama File Keeper (GFK): A Smarter Storage System
- Every uploaded file is mapped with a checksum, making it trackable even if renamed.
- Instead of hard deletions, files now go through a 3-month soft delete process before starting removal process.
- Recovery is now instant using checksum-based matching.
2. Backapp: A Multi-Layered Backup Strategy
We no longer rely on a single storage system. Instead, we implemented a three-layered backup strategy:
- Warm Backup (Every 2 Hours): Real-time sync within the same data center.
- Cold Backup (Every 6 Hours): Replicated to a separate data center.
- Offline Backup (Weekly): Stored on physical HDDs in a separate location.
Database Backups
- Full backups every 24 hours, stored for 12 months.
The Real Test: How This System Saved Us in 2025
Fast forward five years. Gamatrain.com, our new business in the UK, faced another rare incident.
But this time, we didn’t lose a single file.
Why? Because of the lessons we learned in 2019 and the system we built to prevent it.
Lessons for Every Engineer
- Never trust a single storage system—even if it seems rock solid.
- Backups should be independent, multi-layered, and stored in different locations.
- Disasters will happen. Your resilience depends on how well you prepare for them.
#devops #backupstrategy #datarecovery #engineeringfailures #disasterrecovery