
Learning how to predict rare kinds of failures
In a groundbreaking development, researchers at MIT have unveiled a novel computational system designed to predict and diagnose rare yet catastrophic system failures. This innovation comes in the wake of the highly publicized Southwest Airlines crisis in December 2022, which saw over 2 million passengers stranded and incurred losses of $750 million for the airline, triggered initially by severe winter weather in Denver. The MIT team leveraged this real-world event as a critical case study to pinpoint the underlying causes of such widespread, domino-effect breakdowns, aiming to prevent future occurrences.
The research, led by MIT doctoral student Charles Dawson and Professor of Aeronautics and Astronautics Chuchu Fan, alongside colleagues from Harvard University and the University of Michigan, was presented at the International Conference on Learning Representations (ICLR) in Singapore. Their work addresses the deep frustration experienced when complex systems, normally functioning smoothly, suddenly falter due to obscure internal mechanisms.
Building on previous hypothetical failure prediction models, Fan’s lab aimed to transform their theoretical insights into a practical diagnostic tool for real-world systems. “The goal of this project was really to turn that into a diagnostic tool that we could use on real-world systems,” Fan explains. The system enables users to input data from a past system issue or failure, allowing the model to work backward and uncover the root causes, effectively pulling back the curtain on system complexity.
This methodology is designed for a broad spectrum of “cyber-physical problems”—situations where automated decision-making interacts with the unpredictable nature of the physical world. This includes not just airline scheduling, but also autonomous vehicles, robotic teams, and electric grids. Dawson highlights that software decisions, seemingly benign at first, can trigger a cascade of unforeseen consequences when integrated with physical realities.
A significant challenge in the Southwest analysis was the proprietary nature of airline scheduling systems. Unlike robotics, where physical models are often accessible, the researchers had to infer the logic behind decisions using only sparse public data, such as actual arrival and departure times. They meticulously analyzed how a localized weather event in Denver escalated into a nationwide disruption, uncovering a critical link: the deployment and availability of reserve aircraft.
Southwest’s unique operational model, which relies on a single type of aircraft and a scattered reserve system rather than a traditional hub-and-spoke model, played a pivotal role. While Denver’s reserves rapidly depleted, the MIT model revealed a less obvious but more impactful finding: the failure cascaded to seemingly unaffected areas like Las Vegas. The data showed a steady decline in available aircraft in Las Vegas, traced back to interrupted circulation patterns where planes from affected areas typically ended their cycles there. This domino effect ultimately forced Southwest into a drastic “hard reset,” canceling all flights to rebalance their network.
By using extensive data from normal flight operations, the computational model learned the “realm of physical possibility,” allowing it to infer the most probable explanation for failures even from sparse failure data. This innovative approach holds the potential for real-time monitoring systems that could continuously compare current data with normal operations, identifying trends toward extreme events and enabling preemptive measures, such as strategic redeployment of reserve assets.
The team has made their open-source tool, CalNF, available to the public. Meanwhile, Dawson is extending this research to understanding failures in power networks. This pioneering work was supported by NASA, the Air Force Office of Scientific Research, and the MIT-DSTA program, promising a future where rare, high-impact failures can be predicted and potentially averted.



