The Importance of Having, Testing, Updating, Automating and Verifying Your Patching, DR and Backup Plans

The events of last week with CrowdStrike demonstrated the importance of having a disaster recovery plan. Disasters are Murphy’s Law in action – whatever can go wrong, will go wrong. This is not the first time an errant update has broken numerous business systems and it likely will not be the last. No doubt the impacts of this outage were both difficult to plan for and will have lasting impacts in the weeks and months to come. It is, in fact likely that very few DR plans accounted for a possibility of an outage such as this. Security agents are, after all, typically homogenous within an organization and therefore represent a potential single point of failure. However, proper planning for patching and disaster recovery can mitigate or eliminate these issues.

When planning a patching strategy in today’s environment, three major rules exist:

1.) Patch your systems as soon as it is safe to do so. Not patching makes you vulnerable to compromise and the last thing you want to do is deal with a recovery scenario where your data has been compromised.

2.) The second seems like it is the opposite of the first. Don’t patch immediately – at least not on production systems. Patching one thing may break others. Give it a day or more to see if there are reported issues. This doesn’t guarantee there will be no problems, but it reduces the chances of an incident where you are stuck with a problem nobody may yet have a solution to.

3.) Where possible, roll first to a test group to verify that nothing is impacted. If possible, these systems should conduct a suite of tests to verify proper function and should not be end user systems. Then roll to a pilot group of real users then to the remaining systems.

Patching is a balancing game – you don’t want to patch too soon lest an undocumented issue catch you unable to work but you also don’t want to lag so far behind that your systems are vulnerable. Using a phased patching schedule and test systems along with automated validation tests can help make this process faster and eliminate the probability of issues. Using modern IT tools, test systems can be spun up on demand, patched, tested for validation and results can then be sent to the appropriate approvers to automate the next round of deployment to pilot users. Similar to the DevOps approach to development, this can speed your patching process while reducing risk. If issues are found during testing, deployment to production systems is halted. Implementing this type of testing for all patching types (including definition updates for security software) should be part of an effective patching strategy.

Disaster recovery requires even more balance – you need to determine how recent the data needs to be. For things that are transaction sensitive, it may be better to keep the data separate from the standby systems – that way the time critical data is replicated but a change to the production systems that crashes the main servers does not immediately get replicate to the DR systems. In this instance, your DR strategy might be slightly different from the traditional strategy of replicating the live server in totality. Or perhaps you adopt a hybrid approach where the main database is served by high performance servers and the DR replica is sitting on a PaaS instance that is not subject to the same updates, maintaining functionality but eliminating the single point of failure.

Regardless of the disaster recovery strategy, DR only works if you have some form of it in place. So, step one: develop a DR plan. Figure out how long your various IT systems can reasonably be down without severely impacting business operations. Some things may have greater tolerance to downtime than others. Plan for the vital things first. Once the basics are in place, build out from there, adding to your DR build-out as able. Consider if certain services can be moved off servers and into more highly available environments – consider Azure File Shares, AWS FSx or Google Filestore instead of a file server. Consider OneDrive or something similar for user home directories. Analyze where your single points of failure are – if the office burns down, do you have access to your data still? Can your workers work remotely? Planning for an eventual outage can mean the difference between temporary business impairment and going out of business.

Next, once you have a DR plan in place and have begun to implement it, test it. Test it from the ground up. Usually, your DR environment can be isolated from your production environment. Test with a few key stakeholders on regular business operations. Your tests will often reveal gaps or changes to the order of operation in your recovery process. Too many organizations neglect this step. They replicate data and if the replication health appears in the green, that is sufficient. When disaster recovery is needed, the procedure either takes much longer than estimated (impacting the RTO significantly), or it turns out that the recovery model is invalid from the outset, and you don’t have a disaster recovery solution. Waiting until there is a disaster to test your plan is never a good idea. After implementing, test and validate. After making changes, test and validate.

Another common mistake is failing to update the DR plan or the DR components. Both need to be kept up-to-date. If the DR agent(s) on your system(s) fall behind, you may miss your RPO in replication or replication may break entirely. Even worse though is when you have a DR incident and find out that you have an up-to-date instance of your old ERP system but nothing for the new one you cut over to last quarter. Add or remove systems to your plan as needed. Make it part of the checklist. Add the changes to the documentation.

Next – automate where possible. People think that DR testing has to take large amounts of time. This may be true of the first time setup, documentation and validation. Once you have the pieces in place and in the right order, testing tools can automate much of the testing and validation process for you. As with patch testing, automated testing can notify you of issues with your DR setup and configured properly, can perform regular end-to-end testing of your DR process. Occasional manual verification is still highly recommended.

Finally, backups. Backups are not DR by themselves, but they compose a vital part of any DR strategy. In the case of ransomware, it may be necessary to go to backup to restore a previous version of software. In the case of an update that causes issues with major systems but managed to sneak past your testing and approval process, restoring an image or system snapshot may be the fastest route to recovering certain systems. In the case of data corruption or administrative error in a cloud productivity suite, the quickest (and sometimes only) way to recover deleted emails or other items is with a valid SaaS backup strategy. That’s right, if you delete a vital email from the command line or compliance console, Microsoft usually cannot help you recover it in a timely manner. Backup all your important systems. Test the backups from time to time and make sure they work.

For guidance on patching, disaster recovery or backup strategies, feel free to contact solutions architect.

Peregrine

LABS

The Importance of Having, Testing, Updating, Automating and Verifying Your Patching, DR and Backup Plans

Recent Posts

コメント