MIT Researchers Develop AI to Predict and Prevent Rare System Failures

In a world increasingly reliant on complex, interconnected systems, the ability to predict and prevent failures is paramount. Researchers at MIT have developed a novel computational system that uses sparse data from rare failure events, combined with extensive data from normal operations, to pinpoint the root causes of systemic breakdowns. This innovative approach aims to enhance the resilience of systems ranging from airline scheduling to power grids.

The research, led by MIT doctoral student Charles Dawson, Professor Chuchu Fan, and colleagues from Harvard University and the University of Michigan, was presented at the International Conference on Learning Representations (ICLR) in Singapore. The team’s work was inspired by the cascading failures experienced by Southwest Airlines in December 2022, which stranded over 2 million passengers and cost the airline $750 million. This event highlighted how a localized issue could trigger widespread disruptions.

“The motivation behind this work is that it’s really frustrating when we have to interact with these complicated systems, where it’s really hard to understand what’s going on behind the scenes that’s creating these issues or failures that we’re observing,” says Dawson.

The new system builds upon previous research from Fan’s lab, which focused on predicting failures in hypothetical scenarios involving robot teams or complex systems like power grids. The goal was to transform these predictive capabilities into a diagnostic tool applicable to real-world scenarios.

Fan explains, “The goal of this project was really to turn that into a diagnostic tool that we could use on real-world systems.”

The approach involves feeding the system data from a time when a real-world system experienced a failure. The AI then works backward to diagnose the root causes, offering insights into the complexities behind the scenes. This method is designed to be applicable to a broad range of cyber-physical problems, where automated decision-making interacts with real-world messiness.

According to Dawson, such systems often suffer from decisions that initially appear sound but trigger a cascade of unforeseen consequences. Existing tools are adequate for testing standalone software, but challenges arise when software interacts with physical entities in dynamic real-world settings.

One key aspect of the research was addressing the challenge of limited data. Unlike scenarios involving robot teams, where researchers have access to detailed models, systems like airline scheduling involve proprietary data. The MIT team had to rely on publicly available data, such as arrival and departure times, to infer the inner workings of the scheduling system. The data relating to the actual failure was limited to a few days, compared to years of data on normal flight operations.

The researchers found that the way reserve aircraft were deployed played a significant role in the Southwest Airlines crisis. By analyzing flight data, they were able to infer the hidden parameters of aircraft reserves and identify a “leading indicator” of the cascading problems.

“What we’re able to find using our method is, by looking at the public data on arrivals, departures, and delays, we can use our method to back out what the hidden parameters of those aircraft reserves could have been, to explain the observations that we were seeing,” Dawson explains.

Their analysis revealed that the weather-related delays in Denver led to a rapid depletion of reserve aircraft, which then cascaded to other parts of the network, such as Las Vegas, even though those areas were not directly affected by the weather. This disruption of the aircraft circulation cycle ultimately forced Southwest to implement a “hard reset” of their entire system.

The researchers are now working on developing a real-time monitoring system that compares normal operational data with current data to identify trends that could lead to extreme events. This could enable preemptive measures, such as redeploying reserve aircraft to areas of anticipated problems.

Fan’s lab is continuing to develop these systems. In the meantime, they have released an open-source tool called CalNF for analyzing failure systems. Dawson is now applying these methods to understanding failures in power networks.

This research highlights the importance of understanding complex systems and the potential for AI to play a crucial role in predicting and preventing failures. By combining sparse failure data with extensive operational data, the MIT team has developed a powerful tool for enhancing the resilience of critical infrastructure.