How did we lose 10,000 files overnight
The catastrophe that changed everything
In October 2019, we made a decisive mistake that led to the disappearance of 10,000 important files overnight. It was a disaster – those that could destroy our deeds. But five years later, that same experience saved our new company from a greater crisis.
This is a story about data loss, bad formations, and difficult lessons that led us to build a lead -resisting backup system. If you are running a system that stores important data, this may help you avoid making the same errors.
Background: How to store GAMA data
GAMA (GAMA.IR) is a platform for the participation of educational content K-12 launched in 2014 in Iran, with more than 10 million users worldwide. Provides services such as:
- Previous papers
- Learning lessons and resources
- Online exams and school center
- Live broadcast and a community question and answer
- Lessons services
Since the content created by the user, maintaining the storage of safe files was a top priority. We used MoussevDistributed file system with five contracts and triple decline form, ensuring repetition.
Our backup strategy
Simple external HDD as we store copies of each file. I have worked well, and we rarely need it. But then, we made a serious assumption.
Immigration that led to a disaster
One of our engineers suggested immigration to GlusterfsMore well -known files system. It looked great – the ability to expand, apply to higher, and apparently better performance. After assessing the cost and return comparison, we decided to switch.
Two months later, the migration was complete. Our team was pleased with the new system. Everything seemed stable … even it was not.
There was only one small problem:
HDD was 90 % backup, and we needed to make a decision.
Error
Because we never needed our full backup copies before, we assumed that Glusterfs was sufficiently reliable.
We removed the old backup strategy and reliable Glusterfs.
This was a bad decision.
The day that made a mistake in everything
Two months later, one morning, we started receiving reports: some files were missing.
Initially, we thought it was a network defect – something simple. But with we dug deeper, we found that Gluster was showing missing pieces and synchronization errors.
- The files were disappearing.
- More and more pages were throwing mistakes.
- It was spreading quickly.
Immediate response
3:30 AM: We decided to restart the Gluster network, believing that fresh bootsrap would fix the problem. Initially, he seemed to work!
We thought we had resolved it.
Then, the WhatsApp message came from the content team:
“Files are empty.”
Wait, what? The files were present, but they did not contain anything.
We have handed over. Files are still in size and definition data, but when we opened them, they were completely empty.
10,000 files went.
The backup that was useless
We had a HDD backup. He should have saved us, right?
mistake. Because after deporting Glusterfs, we have restructured our guide system. Each file file had a new retail path in the database.
The old backup was useless because they had different file names.
We have tried several ways to recover. Nothing works.
In the end, we had to send an email to thousands of users, and asked them to re -download their lost files.
It was a nightmare. But he forced us to rethink everything.
How to fix it: Submit GAMA (GFK) Files Guard (GFK)
After this disaster, we have completely designed the storage and backup strategy. We had two parts:
1. Gama (GFK) Files Guard: A smarter storage system
- Each downloaded file is set using Checksum, which makes it closed even if it is renamed.
- Instead of solid deletion, files are now going through a soft deletion process for 3 months before the start of the removal process.
- The recovery is now immediately using the Checksum -based match.
2. Backap: A multi -layer backup strategy
We no longer rely on a single storage system. Instead, we have implemented a three -layer backup strategy:
- Warm backup (every two hours): Real -time synchronization within the data center itself.
- Cold backup (every 6 hours): Copy to a separate data center.
- Non -internet backup (weekly): Store them on physical hard drives in a separate location.
Reserve copy for the database
- Full backup every 24 hours, stored for 12 months.
The real test: How did this system save us in 2025
Quickly forward five years. Gamatrain.com, our new work in the United Kingdom, faced another rare accident.
But this time, we did not lose a single file.
Why? Because of the lessons we learned in 2019 and the system we created to prevent it.
Lessons for each engineer
- Never trust a single storage system – even if it looks solid.
- The backup should be independent, multi -layer and store in different locations.
- Disasters will occur. Your elasticity depends on how much quality is prepared for them.
#Devops #Backupstrategy #datarecovery #EnGineeringfails #disastercovery