Vision-Language Models Struggle with Negation: MIT Study Reveals Critical Flaw

Vision-language models (VLMs) are increasingly being used to speed up diagnosis in healthcare and identify defects in manufacturing plants. However, a recent study from MIT reveals a critical flaw: these models struggle with negation, misunderstanding words like “no” and “doesn’t”. This limitation can lead to potentially catastrophic consequences in high-stakes real-world situations.

The study highlights a scenario where a radiologist uses a VLM to analyze a chest X-ray. If the model fails to recognize the absence of an enlarged heart and mistakenly identifies reports with both tissue swelling and an enlarged heart, it could lead to an incorrect diagnosis.

Kumail Alhamoud, the lead author of the study and an MIT graduate student, warns, “Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences.”

The researchers tested VLMs’ ability to identify negation in image captions and found that their performance was often no better than a random guess. To address this, they created a dataset of images with captions including negation words describing missing objects. Retraining a VLM with this dataset improved its ability to retrieve images that do not contain certain objects and boosted accuracy on multiple-choice questions with negated captions.

Despite these improvements, the researchers emphasize that more work is needed to address the underlying causes of this problem. They hope their findings will alert users to this previously unnoticed shortcoming, which could have serious implications in various fields.

Marzyeh Ghassemi, senior author and associate professor at MIT’s Department of Electrical Engineering and Computer Science (EECS), notes, “This is a technical paper, but there are bigger issues to consider. If something as fundamental as negation is broken, we shouldn’t be using large vision/language models in many of the ways we are using them now — without intensive evaluation.”

The core issue lies in how VLMs are trained. They learn to associate images with positive labels in captions, but rarely encounter examples of negation. As Ghassemi explains, “The captions express what is in the images — they are a positive label. And that is actually the whole problem. No one looks at an image of a dog jumping over a fence and captions it by saying ‘a dog jumping over a fence, with no helicopters.’”

To delve deeper, the researchers designed benchmark tasks to test VLMs’ understanding of negation. They used a large language model (LLM) to re-caption images with related objects not present, then tested the VLMs’ ability to retrieve images based on negation. They also created multiple-choice questions where captions differed only by adding or negating an object’s presence in the image.

The models struggled, with image retrieval performance dropping by nearly 25 percent with negated captions. In multiple-choice questions, the best models achieved only about 39 percent accuracy, with some performing at or below random chance. Alhamoud points out that VLMs exhibit an “affirmation bias,” ignoring negation words and focusing on objects in the images instead, regardless of how negation is expressed.

By finetuning VLMs with their new dataset, the researchers achieved performance gains, improving image retrieval abilities by about 10 percent and multiple-choice question accuracy by about 30 percent. Alhamoud suggests that while their solution is not perfect, it signals that the problem is solvable, encouraging others to build upon their work.

The researchers recommend that users carefully evaluate VLMs before deployment, considering the specific problem they are trying to solve. Future research could focus on teaching VLMs to process text and images separately and developing application-specific datasets, such as those for healthcare.