The NIST ‘Recover’ Domain – The importance of a good Disaster Recovery Plan

Last month was another one of those days, there was a global disruption caused by a bug in software. Unfortunately, the error turned out to be so severe that Windows machines went into a blue-screen of death (BOSD). So even though CrowdStrike had fixed the issue within 90 minutes and stopped pushing the faulty update, the damage had been done. I sympathise with the IT departments that had to deal with this as this must have caused massive chaos. This incident, where problems with CrowdStrike security software led to computer system failures worldwide, highlights the need for a robust Disaster Recovery (DR) plan. This article discusses the importance of a good DR plan and highlights the essential steps: inventory, plan, test, learn and repeat.

Inventory: understand what you need to protect

The first step in creating an effective DR plan is taking an inventory. This involves making a complete and detailed list of all critical IT assets within your organization.

This includes servers, network equipment, software applications, data storage and even physical locations. Understanding which systems and data are critical to your core processes helps prioritize protection measures, as well as develop a plan.

When taking inventory, it is important to also identify dependencies between systems. This means understanding how different components of your IT infrastructure are connected and how a failure in one system can impact other systems. It’s advisable here to look especially at the organization’s core processes and, from that perspective, determine how to get these processes back up and running when things go wrong.

Plan: develop a strategic DR plan

With a thorough inventory, you can move on to the planning phase. A strategic DR plan should include clear procedures for different disaster scenarios, such as natural disasters, cyber attacks, hardware failures and human error. It is essential to assign specific responsibilities to team members and ensure that everyone knows what is expected of them in case of an emergency.

A good DR plan also includes a communication plan. This plan should describe how to communicate internally and externally during and after a disaster. The CrowdStrike incident highlights the importance of transparent communication to prevent panic and keep customers and partners informed of the recovery measures taken.

Test: ensure regular exercises

A DR plan is only as effective as the testing you do. Regular tests are crucial to verify that your plan works in practice. This can range from tabletop exercises, where you theoretically walk through disaster scenarios, to full-scale tests where you assess the operation of your DR plan in a realistic situation.

Testing your DR plan helps identify weaknesses and potential bottlenecks. By uncovering these problems before a real disaster strikes, you can ensure that your plan remains up-to-date and effective.

Learn: draw lessons from every incident

After every test or actual disaster, it’s important to carry out an evaluation and learn from the experience. This process includes analyzing what went well, what did not go well and what improvements can be made. Learning from incidents and tests helps to continuously improve and adapt your DR plan to new threats and technologies.

Repeat: continuous improvement and updating

Developing a DR plan is not a one-off task. It is an ongoing process that needs to be repeated and updated regularly. Technologies evolve, new threats emerge and business needs change. By regularly reviewing and updating your DR plan, you can ensure that you are always prepared for the latest challenges.

The CrowdStrike incident highlights how vulnerable even the most sophisticated IT systems can be and how important it is to have a robust and up-to-date DR plan. By taking inventory, planning, testing, learning and repeating, you can minimize the impact of disasters and ensure the continuity of your business processes. The IT chain is only as strong as its weakest link!

Of course, it is good to keep in mind that despite CrowdStrike causing this catarostrophic incident, they still prevented more downtime for customers than they caused.