The world is a much more connected place than it was 20 years ago. With the change came new expectations. Downtimes can cripple businesses. There is an expectation of privacy and security. If we post on Facebook and the post gets lost due to any kind of issue, people will not tolerate it. Data integrity and durability are more critical than ever.
As an organization, we must consider multiple technological tenets—security, availability, latency, durability, integrity, reliability, and privacy. We don’t get to choose from this list; we just sort it by importance based on business needs.
Business continuity
Disaster can come in many flavors, ranging from natural disasters to hardware failures, misconfiguration, and cyberattacks. The problem with cyberattacks is that we risk having our data stolen and wonder if the rest of the data is trustworthy. Maybe the attack was a sleight of hand, and the primary objective was to introduce malicious data that would haunt the organization forever.
After the security incident, we need to recover the data and restore the service to the level our customers expect. One of the issues we face is reconstructing the original picture and configuration. We need to understand how all the pieces work together and their interdependency. Data restoration is more than just restoring a highly scalable web service snapshot.
Objective
At this point, our primary objective should be to ensure that our services are available and reliable by whatever means necessary. In the context of dependency management, we need to have a comprehensive data and dependency catalog, including the relationships between the entities. This provides us with the necessary insight into what exists. We also need regular operations testing and robust disaster recovery through automation and manual runbooks. We need to cultivate a culture of disaster recovery fire drills and identify fallbacks.
Maturity Model
Like an individual who enters college goes through freshman, sophomore, junior, and senior years before graduating, businesses need to follow a slow, progressive approach. It’s impractical for a company or organization to go from no disaster recovery to a fully-fledged, perfectly fleshed-out plan. The road to success is incremental, and each step builds on the previous step.
Maturity models examine an organization’s current state from any specific perspective, such as disaster recovery, and help the organization reach the desired state or capability. As the organization progresses toward the end goal, it goes through various checkpoints, usually called Levels. Most Maturity Models have four levels, starting from Level 1 up to Level 4.
Why are we talking about maturity models? One step in such a disaster recovery maturity model requires that the organization understand dependencies among various entities and data objects. Recovery usually requires restoring the data and bringing back critical systems in a specific incremental manner,
Implementation
Based on the business needs, understanding dependency and its relationship can be somewhere in the middle levels of the maturity models. In a 4-level system, it is most likely in Level 2 or Level 3.
The other approach is to split the Dependency Management into the same number of milestones as the number of levels. Suppose we are using a 4-level maturity model. In that case, we can split up dependency management to span all 4 Levels, starting with the most straightforward steps and going all the way to the most complicated, expensive, and involved steps.
Going forward, we will only discuss the dependency management aspect of this maturity model.
Level 1 – Discovering and Understanding Dependencies
For many organizations, this can be the most daunting task. If dependency management was never considered necessary, there might be no institutional knowledge. We will start by discussing technical dependencies and then discuss process dependencies.
Regarding technical dependencies, we need to identify the direct dependencies first. If the code is written in JavaScript and runs on a cloud service, then the direct dependencies might be JavaScript, NodeJS, AWS Lambda, etc. If we use npm for package management, the packages listed in the dependency file are direct dependencies. Next, we have transitive dependencies, which are pulled automatically from direct dependencies. In practice, this forms a directed acyclic graph.
Another aspect of technical dependencies is the cloud architecture. Let’s take an example of AWS—we might use EC2, S3, API Gateway, CloudFront, ELB, etc. If the cloud resources were deployed using tools like CloudFormation or CDK, then it would be trivial to determine the dependencies. If the resources were created manually, the dependency diagram must be made manually by reading the source code and drawing conclusions.
Process dependency falls under the non-technical category and defines how different stakeholders work together to move forward. It can also outline the release process or runbooks to invoke during an incident.
Level 2 – Setup Continuous Dependencies Validation
Dependencies can change since there are too many cooks in the kitchen. The organization’s continuous integration tool should enumerate and track the direct code and library dependencies. Every time the code is pushed, a snapshot of current pinned dependencies should be saved to recreate the specific circumstances or restore to a known good state.
Like the Continuous integration concept, the non-technical aspects should have some automation. Alerts should be created on metric, and using tools like PagerDuty, appropriate people should be notified of issues, and proper escalation channels should be created. It should be tested occasionally to avoid the known problem called Dogs Not Barking – a situation when you wonder if you are not getting any alerts because nothing is wrong or the alerting logic is broken.
Some dependencies can be validated by using the Scream test. Under proper notification and preparation, turn off specific dependencies and see what fails. You may find a Load Bearing Mac Mini. Collect all of this configuration and manage it through some kind of provisioning tool like Chef and Terraform
Level 3 – Fixing dependencies by adding, updating, or removing
Once we have all the dependencies figured out, take some time and pat yourself on the back. Now, back on the issue, the work isn’t done. During the quest, you realize that some dependencies are causing more problems than they solve.
One approach would be to update the dependencies. For example, if we are using old versions of OpenSSL, updating them to newer versions can solve many of the warnings raised by the build system and even help us pass any security compliance. Sometimes, the specific versions of two separate packages are incompatible. The solution is to find versions that can work together. This involves changing the version and might involve downgrading one or more packages.
Sometimes, just removing a package or dependency solves the problem. It could have been a dead dependency with no further need but causing warnings and spamming the application logs.
Under some circumstances, we might have to add a dependency to bring harmony. Suppose a transitive dependency is causing the problem, but we cannot update the direct dependency. In that case, explicitly adding the fixed version of the transitive dependency as a direct dependency can be an acceptable solution.
Level 4 – Manual and Automated Disaster Recovery Drills
This is the highest form of maturity in disaster recovery. We need runbooks that outline the roles and responsibilities of individuals involved. The runbook should also define the interdependencies and criticality of each step. The disaster recovery process should be practiced regularly by simulating a disaster and identifying possible sources of friction that get introduced when the plan is set in motion.
The gold standard should be automated disaster recovery under human supervision. Automating the disaster recovery workflow ensures that humans do not make mistakes that go unnoticed. Automating has an additional advantage in that we can now run stress tests and simulation disasters by turning off certain network parts in a controlled manner to validate whether our automated disaster recovery is performing as expected.
Conclusion
Dependency management plays a critical role in disaster recovery, which is essential after a security incident. By following a structured and methodological approach, organizations can build their understanding progressively and control the dependencies. Each maturity model level builds on the previous one and strengthens the organization’s security posture.
While it can be easy to write off dependency management as a superfluous technical exercise, it is primarily about creating a comprehensive framework that brings together computer systems, people, and processes to work harmoniously. The systematic approach ensures that when disaster strikes, even in the form of a security breach, we can safely recover the data and be confident in the integrity of the said data.