
MIT Researchers Develop New Method to Improve Reliability of Radiologists’ Diagnostic Reports Using AI
A multidisciplinary team of researchers from MIT, in collaboration with Harvard Medical School-affiliated hospitals, has developed a novel framework to assess and improve the reliability of radiologists’ diagnostic reports. This innovative approach addresses the inherent ambiguity in medical images, such as X-rays, where radiologists often use terms like “may” or “likely” to describe the presence of a pathology.
The study reveals that radiologists tend to be overconfident when using phrases like “very likely” and underconfident with terms like “possibly.” The new framework quantifies the reliability of certainty phrases used by radiologists, providing suggestions to improve the accuracy of clinical reporting.
Peiqi Wang, an MIT graduate student and lead author of the research paper, emphasizes the importance of precise language in radiology reports. “The words radiologists use are important. They affect how doctors intervene, in terms of their decision making for the patient. If these practitioners can be more reliable in their reporting, patients will be the ultimate beneficiaries,” says Wang.
The team, led by senior author Polina Golland, a Sunlin and Priscilla Chou Professor of Electrical Engineering and Computer Science (EECS) at MIT, utilized clinical data to align the language used by radiologists with the actual occurrence of pathologies. The framework also demonstrates potential for enhancing the calibration of large language models, ensuring that the confidence expressed by these models aligns with their prediction accuracy.
The research highlights the challenges in interpreting ambiguous natural language terms like “possibly” and “likely.” Unlike existing calibration methods that rely on AI model confidence scores, this new approach treats certainty phrases as probability distributions, capturing the nuances of each word’s meaning.
By leveraging prior surveys of radiologists, the researchers obtained probability distributions corresponding to various diagnostic certainty phrases. They then formulated an optimization problem to adjust the frequency of certain phrases, ensuring a better alignment between confidence and reality. This results in a calibration map that suggests specific certainty terms to enhance report accuracy.
The study found that radiologists often underdiagnose common conditions like atelectasis while overdiagnosing ambiguous conditions like infection. This framework offers a pathway for improving both diagnostic accuracy and patient care.
Atul B. Shinagare, associate professor of radiology at Harvard Medical School, who was not involved in the study, notes the potential impact of this research: “This study takes a novel approach to analyzing and calibrating how radiologists express diagnostic certainty in chest X-ray reports, offering feedback on term usage and associated outcomes. This approach has the potential to improve radiologists’ accuracy and communication, which will help improve patient care.”
Future research will focus on expanding the study to include abdominal CT scans and assessing radiologists’ receptiveness to calibration-improving suggestions. The team is also interested to see if the radiologists can adjust their use of certain phrases effectively.



