Home Blog Newsfeed MIT Researchers Develop AI to Predict Rare System Failures
MIT Researchers Develop AI to Predict Rare System Failures

MIT Researchers Develop AI to Predict Rare System Failures

In a world increasingly reliant on complex systems, the ability to predict and prevent failures is paramount. Researchers at MIT have developed a novel computational system that leverages sparse data from rare failure events alongside extensive data from normal operations to pinpoint the root causes of system breakdowns. This innovative approach aims to provide insights that can help prevent future failures across various cyber-physical systems.

The research, presented at the International Conference on Learning Representations (ICLR) in Singapore, was led by MIT doctoral student Charles Dawson, professor of aeronautics and astronautics Chuchu Fan, and colleagues from Harvard University and the University of Michigan. Their work was motivated by the frustration of interacting with complex systems where the underlying causes of failures are often obscure.

“The motivation behind this work is that it’s really frustrating when we have to interact with these complicated systems, where it’s really hard to understand what’s going on behind the scenes that’s creating these issues or failures that we’re observing,” says Dawson.

The new system builds upon previous research from Fan’s lab, which focused on predicting hypothetical failures in systems like robot teams or the power grid. This project aimed to translate that predictive capability into a practical diagnostic tool. “The goal of this project,” Fan explains, “was really to turn that into a diagnostic tool that we could use on real-world systems.” The tool allows users to input data from a system failure, enabling the model to diagnose the root causes and offer a glimpse into the complexities behind the scenes.

The researchers intend for their method to be applicable to a broad range of cyber-physical problems, which involve automated decision-making interacting with real-world complexities. These systems, such as aircraft scheduling, autonomous vehicle movements, and electric grid control, often face unexpected domino effects from seemingly minor initial decisions.

One key challenge was addressing the proprietary nature of some system data. Unlike robotics, where accurate models can be created, airline scheduling relies on business information that is not publicly available. The researchers overcame this by using publicly available data, such as arrival and departure times, to infer the hidden parameters influencing the system’s behavior.

To demonstrate the effectiveness of their system, the researchers analyzed the Southwest Airlines scheduling crisis of December 2022, which stranded over 2 million passengers and cost the airline $750 million. By examining flight data, they identified that the deployment of reserve aircraft played a crucial role in the crisis. The system revealed that the way reserves were deployed was a “leading indicator” of the nationwide breakdown. While some areas directly affected by weather recovered quickly, others suffered from a lack of available reserve aircraft.

“What we’re able to find using our method is, by looking at the public data on arrivals, departures, and delays, we can use our method to back out what the hidden parameters of those aircraft reserves could have been, to explain the observations that we were seeing,” Dawson says.

Ultimately, Southwest Airlines had to perform a “hard reset” of their system, canceling all flights and repositioning aircraft to rebalance their reserves. The researchers’ model works by running the scheduling system backward, using observed outcomes to determine the initial conditions that could have produced those outcomes.

The team’s research may pave the way for a real-time monitoring system that compares normal operations data with current data to identify trends and predict potential extreme events. This could enable preemptive measures, such as redeploying reserve aircraft to areas anticipating problems.

Fan’s lab is continuing to develop such systems, and they have released an open-source tool called CalNF for analyzing failure systems. Dawson is now applying these methods to understand failures in power networks.

The research team also included Max Li from the University of Michigan and Van Tran from Harvard University. The work was supported by NASA, the Air Force Office of Scientific Research, and the MIT-DSTA program.

Add comment

Sign Up to receive the latest updates and news

Newsletter

Bengaluru, Karnataka, India.
Follow our social media
© 2025 Proaitools. All rights reserved.