Business is booming for air safety investigators around the world. Following a period of relative calm in 2017, the safest year on record for passenger airlines, there has now been a steady uptick in both accidents and fatalities. In fact, according to statistics collected by the Air Safety Network, the number of accidents since the end of 2017 is now above the five-year average. Two Boeing 737-8 Max accidents since October 2018 have not helped; collectively, these events have accounted for the loss of 346 lives.
So, why do airplanes crash? There are some usual suspects such as “gravity beats lift” or “drag defeats thrust,” but to really determine cause, investigators subscribe to an accident-causation model. Personally, I like James Reason’s Swiss cheese model of accident causation since it is a useful tool to explain very complex events.
However, the first step is to view each event with a wide lens and understand, as aviation safety researcher Sydney Dekker suggests, that “accidents are not accidents at all, but a failure in risk management.” To become even more open-minded, think of them as a “failure in imagination”—that’s how the 9/11 Commission report described the deep institutional failures associated with the 2001 terrorist attacks.
Reason’s Swiss cheese model gained popularity because “it illustrates that although many layers of defense lie between hazards and accidents, there are flaws in each layer that, if aligned, can allow the accident to occur.” By taking this approach, Reason’s model explores both active and latent failures and the four failure domains: organizational influences, supervision, preconditions, and specific acts. This model is a good way to look deeper into the human, technological, or organizational aspects of an accident.
Focusing on the information released to date from the two 737-8 Max accidents, let’s explore some of the latent and active failures. Latent failures are those that lie dormant for weeks, months, or even years. These failures are waiting for an opportunity. Active failures involve unsafe acts that can be directly linked to an accident.
Understand that this exercise is to demonstrate the complexities of determining cause and is not intended to speculate on the actual cause of either accident, which will come out in the final reports from the respective investigative bodies.
The 737 Max accidents are wrought with latent failures. Organizational influences involve the manufacturer, regulator, and/or operator—in some cases a combination of two or more. As an example, the requirement to achieve a common type-rating is driven by an airline’s desire to cut training costs. Aircraft manufacturers (all of them) want to sell airplanes, so to satisfy the needs of the customer the 737 Max has a common type-rating and requires minimal differences training—by video or bulletin, not in the simulator.
And as aircraft become more complex and automated, the philosophy from OEMs has shifted over the years to provide less-detailed information in training materials. As an example, during my Boeing 727 training, the systems course would “build” each system. In contrast, during my 747-400 training, the systems portion was more related to “operating” each system.
Other organizational influences identified focus on the regulator. In this case, the FAA’s organization designation authorization (ODA) program has been harshly criticized by lawmakers. During a Senate aviation subcommittee hearing in March, Senator Richard Blumenthal questioned “the system that led to outsourcing safety” to the manufacturers and added, “The fact is that the FAA decided to do safety on the cheap and put the fox in charge of the hen house.” Not exactly, and this is a bit of irony.
The origins of this “outsourcing” are based on past FAA reauthorizations that required an expansion of the ODA program due to a lack of appropriated funding. FAA Administrator Dan Elwell, in defense of the program, stated that to cover all the functions of ODA, the agency would have to add 10,000 employees at a cost of $1.8 billion.
Another latent failure identified in the Ethiopian Airlines accident was a low-time first officer flying a complex aircraft; this would be classified as unsafe supervision. Even though the first officer was current and qualified to fly in the Ethiopian “system,” 200 hours of total flight time is not enough, especially when things go wrong.
The cognitive skills, crew interactions, and situational awareness required to handle complex emergencies are developed over time. In the U.S., the unsafe combination of those was highlighted during the Colgan Flight 3407 accident, and the ATP/1,500-hour rule was enacted to protect the traveling public.
In the case of the 737 Max accidents, much has been written about the maneuvering characteristics augmentation system (MCAS)—the system that “misfired” during each event. It’s intentional that MCAS has not been mentioned until now.
MCAS version 1.0 with its single point of failure (one bad AOA sensor input) is considered another latent failure—all it would take to become active is a failed or bad sensor to make the system go haywire. In retrospect, it doesn’t take a lot of imagination to see how the design of this system could go bad.
As described, these latent failures—most with strong organizational influences and many with economic ties—were brewing in the background for decades. All it took was an active failure of a poorly designed system to start a chain of events that would find each hole in the Swiss cheese model. MCAS and the 737 Max simply exposed several latent failures that were—and still are—present in the system.