Vision-Language Models Struggle with Negation, MIT Study Reveals

In a recent study from MIT, researchers have uncovered a significant limitation in vision-language models (VLMs): their inability to accurately process queries containing negation words. This deficiency can lead to critical errors in real-world applications, particularly those requiring precise understanding of what is *not* present in an image.

The study highlights a scenario involving a radiologist using a VLM to analyze chest X-rays. If the model fails to differentiate between a patient *with* tissue swelling and an enlarged heart versus one *without* an enlarged heart, the diagnosis could be severely compromised. The presence or absence of a single condition drastically alters the potential underlying causes.

Kumail Alhamoud, an MIT graduate student and lead author of the study, emphasizes the potential for “catastrophic consequences” if these models are used without careful evaluation. The research, detailed in a paper published on arXiv, demonstrates that VLMs often perform no better than random chance when identifying negation in image captions.

To address this issue, the researchers created a dataset of images with captions specifically designed to include negation, describing missing objects. Retraining VLMs with this dataset showed improved performance in retrieving images that do not contain certain objects and in answering multiple-choice questions with negated captions.

Despite these improvements, the researchers caution that this is not a complete solution. Marzyeh Ghassemi, an associate professor at MIT, stresses the need for intensive evaluation before deploying VLMs in high-stakes environments. This includes applications such as determining patient treatments or identifying product defects.

The core of the problem lies in how VLMs are trained. These models learn by encoding vast collections of images and their corresponding captions as numerical vectors. Because typical image-caption datasets rarely include examples of negation, VLMs struggle to understand what is *not* present in an image.

The researchers designed benchmark tasks to further investigate this issue. They used a large language model (LLM) to re-caption images, adding descriptions of objects *not* in the image. They then tested the VLMs’ ability to retrieve images based on prompts with negation words. They also created multiple-choice questions where the captions differed only by the presence or absence of an object.

The results were telling: image retrieval performance dropped significantly with negated captions, and accuracy on multiple-choice questions was often at or below random chance. This failure is attributed to what the researchers call “affirmation bias,” where VLMs tend to ignore negation words and focus solely on the objects present in the images.

While acknowledging the limitations of their data augmentation approach, the researchers hope that their work signals that this is a solvable problem and encourages others to improve upon their solution. They also suggest that users should carefully test VLMs before deploying them in real-world applications. Future research directions include teaching VLMs to process text and images separately and developing more specific datasets for applications like healthcare.