
LLMs factor in unrelated information when recommending medical treatments
A recent groundbreaking study by MIT researchers has uncovered a significant flaw in Large Language Models (LLMs) when deployed for medical treatment recommendations: their susceptibility to non-clinical information. The research highlights that seemingly innocuous elements in patient messages, such as typos, extra white space, missing gender markers, or the use of uncertain and informal language, can dramatically alter an LLM’s clinical judgment, leading to potentially erroneous advice.
The study found that making minor stylistic or grammatical changes to patient communications substantially increases the likelihood that an LLM will recommend self-management for a health condition, even when professional medical care is warranted. This poses a serious risk to patient safety, particularly in high-stakes healthcare environments where LLMs are increasingly being integrated to streamline tasks and assist overburdened clinicians.
One of the most alarming findings was the disproportionate impact on female patients. The analysis revealed that non-clinical variations in text were more likely to lead to incorrect self-management recommendations for women, even when all gender cues were removed from the clinical context. Human doctors, in contrast, consistently advised seeking medical care for these cases, underscoring a critical bias in the LLMs.
Marzyeh Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and senior author of the study, emphasized the urgent need for rigorous auditing of LLMs before their deployment in healthcare. “This work is strong evidence that models must be audited before use in health care — which is a setting where they are already in use,” Ghassemi stated.
Abinitha Gourabathina, an EECS graduate student and lead author of the study, noted that LLMs are often trained and tested on structured medical exam questions, which contrasts sharply with the messy reality of real-world patient communication. “There is still so much about LLMs that we don’t know,” Gourabathina added, stressing the need for more in-depth research before high-stakes applications.
The researchers, including graduate student Eileen Pan and postdoc Walter Gerych, designed their study to mimic realistic communication challenges. They perturbed thousands of patient notes with elements like extra spaces for patients with limited English proficiency, typos for those with less technological aptitude, and uncertain language for individuals with health anxiety. They then evaluated four LLMs, including the prominent GPT-4 and a smaller medical-specific LLM.
Across the board, the LLMs showed a concerning 7 to 9 percent increase in self-management suggestions when fed perturbed data. The use of “colorful language” (slang or dramatic expressions) had the most significant impact on their recommendations. Alarmingly, many of these errors, where patients with serious conditions were advised to self-manage, would likely be missed by traditional accuracy tests focused on overall clinical performance.
Further follow-up work by the researchers confirmed that human clinicians were not affected by these same changes in patient messages, highlighting a fundamental fragility in LLMs compared to human judgment. “LLMs were not designed to prioritize patient medical care. LLMs are flexible and performant enough on average that we might think this is a good use case. But we don’t want to optimize a health care system that only works well for patients in specific groups,” Ghassemi elaborated.
This research underscores a critical need for more robust studies of LLMs in real-world clinical contexts, with a focus on understanding and mitigating biases introduced by non-clinical input. Future work will explore how LLMs infer gender from text and aim to design natural language perturbations that capture other vulnerable patient populations more accurately.



