In the last article, we spoke about checklists, how they can be used objectively and subjectively in a risk management perspective, how checklists touch on the critical items that if skipped or forgotten could spell disaster. We illustrated how the use of checklists is not about ticking boxes, but can be used to push security culture, teamwork, and discipline. We started to discuss how some checklists need to be practiced.
In small, multi-engine aircraft, the engines are mounted on the wings. This creates a unique characteristic, where in the event of an engine failure, the aircraft is yawed or turned because the working engine on one wing is pulling forward, and the non-working engine is just creating additional drag. And if you exceed the “critical angle of attack” on a control surface, such as the vertical stabilizer and rudder, that control surface will “stall” and no longer provide control. The speed at which this vertical control surface stall occurs is called the minimum controllable airspeed, and when this happens the aircraft will lose directional control and spin which is almost never recoverable. The most critical time for this to happen is when you’re in a high power/low speed scenario such as on the initial climb-out from an airport. In training, there is a procedure and checklist that you follow, but because of the critical nature and timing required, you drill this procedure until it becomes automatic. This is called “the-drill”.
There are other procedures and checklists that do not require that level of muscle memory drilling, but are still important. The regulations require that a pilot perform three take-offs and landings within the previous 90 days before taking passengers. If you fly a tail-wheel as opposed to tricycle gear, those landings must be to a full stop. There is a similar requirement to perform three take-offs and landings to a full stop that must be performed from one hour after sunset to one hour before sunrise within the preceding 90 days before you can legally take passengers on a night-time flight.
In aviation, these are “currency requirements”, the minimum necessary to be legal for that particular flying activity. But in aviation training, there is a difference between currency and proficiency. If you only fly once every three months, and it's been three months since you flew, would you feel comfortable taking passengers? If you were a passenger, would you feel comfortable riding in an airplane knowing that the pilot is only performing the minimum currency requirements?
Cybersecurity professionals have been stressing the importance of testing procedures. For example, IT teams have been performing data backup for a long time. But do they test if they can restore the backup if it is needed? Do they test often enough? IT or Operations teams are at times reluctant to run through the drills such as restoring the backups they created, or running through a disaster recovery or business continuity drill. Don’t get me wrong, there are organizations out there that are on the forefront of ensuring reliability and integrity of their operations. For instance, Netflix released the code behind “Chaos Monkey” and the “Simian Army” which are a suite of tools designed to test reliability, security or resilience of their infrastructure.
In fact, while working for a company in the mid-2000’s, my team and I created a disaster recovery plan to fail-over from our facility in California to a facility in Georgia. In order to test the capabilities of our DR plan, we did not run scripts, but during a planned maintenance window, with the service still running we walked into the server room and pulled a blade-server out of the chassis while it was still running in order to test failover capabilities. After we restored the server to running and stable condition, we performed a post-mortem where we documented what went well, what needed improvement, and other lessons learned. Any time a major component changed within the infrastructure, we would run these drills to ensure our assumptions of our plan were still valid and our business could continue.
The use of an uninterruptible power supply, or battery backup, is designed to keep the power on in the event of a power disruption. I have personal experiences where IT teams are expecting there will be no interruption, but when there is an actual power outage the UPS is not up to the task. I’m reminded of one incident where the power dipped for a fraction of a second, but the UPS hadn’t been tested. The computer room floor went completely offline for a couple of minutes and some of the servers required manual intervention to come back online.
There are times when you test a process or procedure and everything goes wrong. Using another UPS example, a colleague of mine tells me of a story where the company was performing a test of the UPS system, and when they cut the incoming power, the system failed and services still went offline. That part of the story can be chalked up to “bad luck”. But, the company was not ready for the failure and it took them several days to bring everything back online.
And that brings me to drilling your incident response process and procedures. Do you know what to do when things go wrong? Do you know who to call, and how to get a hold of them? Have you tried it? Are you running the drills and being comfortable that the processes work? When they don’t work (or even when they do work), do you reflect on what happened, what went right? Why did it go right? What went wrong? Why did it go wrong? Can we improve the process? These retrospective or post-mortem techniques can be used to improve the process.
We don't rise to the level of our expectations, we fall to the level of our training.
― Archilochus
My point here is that being current or compliant does not mean you are proficient. Don’t do the minimum so that you meet compliance, a standard, or regulation. Drill the procedures until you’re comfortable and confident that, in an emergency, you will be able to use those skills. If you listen to recordings from air traffic control emergencies, for example, Southwest Flight 1380, or US Airways Flight 1549, the pilot has a calm cool demeanor even though they are in the middle of an emergency. This is because they’ve practiced emergencies so many times, and they have their checklists, that it's just another day at the office.
In my next article, we will discuss the differences between Aviation and Cybersecurity in training and requirements when there is an incident such as a crash or failure of a system or a pattern of issues, i.e. “Controlled Flight Into Terrain”.