Infrastructure Insight: Resilience is Uptime’s Secret Sauce
This session reveals today’s tools, technologies, and methods to keep operations online and prevent the next crisis at your enterprise.
November 18, 2024
Building a resilient organization can be the difference between life and death as it relates to business continuity and uptime. When starting your journey towards resilience, you’ll want to leverage a multi-pronged approach by utilizing policies, processes, people and technology to achieve your goals.
In this archived keynote session, Alapan Arnab, vCISO and consultant for cybersecurity and resilience of Apedemak Consulting, explores methods to keep operations online in the face of any challenge.
This segment was part of our live virtual event titled, “A Handbook for Infrastructure Security & Resiliency” The event was presented by Network Computing and Data Center Knowledge on November 7, 2024.
A transcript of the video follows below. Minor edits have been made for clarity.
Alapan Arnab: Moving on to the other side, what happens when you have an incident? The incident response is really a collection of discrete events that come together in how you do the overall recovery. To systematically improve your time to recovery, you need to have all these elements and fine tune each of them to your organization's requirements.
Starting at the left-hand side with the incident, which is the detection, you could look at things like observability tooling. You could also look at logs and event correlation, because you may end up with multiple types of observability tools that give you different levels of information. Linked to the tooling around detection is alerting.
It's one thing to know something has gone wrong through an observability tool, but it's another thing for the teams that need to react to it to be aware. Alerts come in from your teams’ messages and emails, but also phone calls and text messages. There are tools out there that will do automatic page outs.
There are tools out there to manage the teams around your recovery. This includes people being off on holidays or being away and people that work shifts. How do you manage all these pieces in the broader organization? Once you have been alerted, the next step is to assemble your recovery team.
This is where your incident processes and recovery playbooks become the focus to ensure the being assembled knows their roles and responsibilities. They should know how to begin investigating the cause of the disruption. This requires training and it requires skills in the recovery team.
Part of it is knowing the environment and documentation, which obviously helps. Being able to read the logs that come from your log management, and knowing what common issues have plagued the environment or the technical environments helps. Of course, change records, because in many cases incidents arise due to a change.
After investigation is obviously the fix. One part of the fix could be isolation. You could talk about doing your recovery instructions from your recovery playbook and look at automation in your recovery. This part of the recovery could also be to leverage environments such as your disaster recovery environments.
You can potentially isolate the problem, recover to your disaster recovery, and then continue the fix. Now, the service is back up, and then you have a lower priority incident. Lastly is validation. I'll tell you a good example of validation that I've had much experience with.
Let's say you bring back the service, but the service has itself some other elements that have not been recovered. Having automated testing helps you validate the full chain of the services running. The last piece of the recovery is to adapt and learn from the disruption's post-mortems.
This allows you to really understand the root cause of failure, which is a key element. One of the key things to highlight is that there can be more than one root cause. The root cause is not going to necessarily be a single item because it could be multiple contributing issues.
You should be asking why this happened multiple times, which can really help you to get to the root cause. The reasons can be due to intent, such as your cyber issues. It could be due to control failures, such as errors, design issues, process failures and even accidents.
But trying to understand why would give you a much clearer answer on all the contributing factors. Remediation is something to implement once you have recovered, so that you have a longer-term fix. It's also important to note that remediation could be required for many other systems in the organization.
So, you may have had failure in one environment, but that same failure could be required in multiple places.
About the Author
You May Also Like