Home Blog Newsfeed MIT Researchers Develop AI to Predict Rare System Failures Like the Southwest Airlines Meltdown
MIT Researchers Develop AI to Predict Rare System Failures Like the Southwest Airlines Meltdown

MIT Researchers Develop AI to Predict Rare System Failures Like the Southwest Airlines Meltdown

In December 2022, Southwest Airlines faced a system-wide breakdown triggered by severe winter weather in Denver, leading to over 2 million passengers stranded and $750 million in losses. This incident prompted researchers at MIT to investigate how seemingly localized events can cause widespread failures in complex systems.

Now, a team from MIT, in collaboration with Harvard University and the University of Michigan, has developed a computational system that combines sparse data from rare failure events with extensive data from normal operations. This system aims to pinpoint the root causes of failures and identify ways to prevent similar incidents in the future.

The findings were presented by MIT doctoral student Charles Dawson, professor of aeronautics and astronautics Chuchu Fan, and their colleagues at the International Conference on Learning Representations (ICLR) in Singapore from April 24-28. The research paper details their approach to diagnosing complex system failures.

“The motivation behind this work is that it’s really frustrating when we have to interact with these complicated systems, where it’s really hard to understand what’s going on behind the scenes that’s creating these issues or failures that we’re observing,” says Dawson.

Fan explains that this new work builds upon previous research from her lab, which focused on predicting hypothetical failure scenarios in systems like robot teams or the power grid. The goal of this project was to create a diagnostic tool applicable to real-world systems.

Dawson adds, “The idea was to provide a way that someone could give us data from a time when this real-world system had an issue or a failure, and we can try to diagnose the root causes, and provide a little bit of a look behind the curtain at this complexity.”

The developed methods are intended for a broad class of cyber-physical problems, where automated decision-making interacts with the real world. While tools exist for testing standalone software systems, challenges arise when software interacts with physical entities, such as scheduling aircraft, managing autonomous vehicles, or controlling electric grids. In these systems, a seemingly sound software decision can trigger a cascade of negative consequences.

One significant difference between systems like robot teams and airline scheduling is the availability of models. Fan notes that robotics benefits from a good understanding of the physics involved, allowing for the creation of accurate models. However, airline scheduling involves proprietary business information, requiring researchers to infer underlying decisions from sparse, publicly available data, such as flight arrival and departure times.

Fan highlights that the amount of data related to actual failures is limited compared to the extensive data on normal flight operations. The weather events in Denver during Southwest’s crisis were evident in the flight data, particularly in the longer turnaround times at the Denver airport. However, the cascading effects throughout the system required deeper analysis.

A key factor in the crisis was the deployment of reserve aircraft. Airlines typically keep reserve planes at various airports to substitute planes with issues. Southwest, which uses only one type of plane, distributes its reserve planes throughout its network, unlike the hub-and-spoke system used by most airlines. The researchers found that the way these planes were deployed played a crucial role in the crisis.

Dawson explains that the lack of public data on the location of aircraft throughout the Southwest network posed a challenge. “What we’re able to find using our method is, by looking at the public data on arrivals, departures, and delays, we can use our method to back out what the hidden parameters of those aircraft reserves could have been, to explain the observations that we were seeing.”

The research revealed that the deployment of reserves was a leading indicator of the nationwide crisis. While some areas directly affected by the weather recovered quickly, others, lacking available reserves, continued to deteriorate. For example, Denver’s reserves rapidly dwindled due to weather delays, and the method traced the failure from Denver to Las Vegas, where the number of aircraft serving flights steadily declined despite the absence of severe weather.

Dawson notes that the Southwest network involves circulations of aircraft, where a plane might start in California, fly to Denver, and end in Las Vegas. The Denver storm interrupted this cycle, causing reserves in unaffected locations like Las Vegas to deteriorate.

Ultimately, Southwest had to perform a “hard reset” of its system, canceling all flights and flying empty aircraft to rebalance reserves.

The researchers developed a model of the scheduling system’s intended operation and then used their method to run the model backward, analyzing observed outcomes to determine the initial conditions that could have produced those outcomes.

Dawson emphasizes that the extensive data on typical operations helped teach the computational model “what is feasible, what is possible, what’s the realm of physical possibility here,” enabling it to identify the most likely explanation for the failure in the extreme event.

This could lead to a real-time monitoring system that constantly compares normal and current data to detect trends and predict extreme events, allowing for preemptive measures like redeploying reserve aircraft to anticipated problem areas.

Fan’s lab is continuing to develop such systems, and they have released an open-source tool called CalNF for analyzing failure systems. Dawson is now applying these methods to understand failures in power networks.

The research team also included Max Li from the University of Michigan and Van Tran from Harvard University. The work was supported by NASA, the Air Force Office of Scientific Research, and the MIT-DSTA program.

Add comment

Sign Up to receive the latest updates and news

Newsletter

Bengaluru, Karnataka, India.
Follow our social media
© 2025 Proaitools. All rights reserved.