Study shows vision-language models can’t handle queries with negation words

In a groundbreaking new study, researchers from MIT have uncovered a critical vulnerability in widely used vision-language models (VLMs): their alarming inability to comprehend negation. This fundamental flaw, highlighted by scenarios like a radiologist misinterpreting a chest X-ray, poses significant risks in high-stakes applications, from medical diagnosis to industrial quality control.

Imagine a radiologist using an AI tool to sift through patient records, seeking those with tissue swelling but “no” enlarged heart. If the VLM fails to process the “no,” it could erroneously flag cases with both conditions, leading to vastly different, potentially incorrect, diagnostic paths. As Kumail Alhamoud, an MIT graduate student and lead author of the study, warns, “Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences.”

The study, which will be presented at the Conference on Computer Vision and Pattern Recognition, reveals that VLMs often perform no better than a random guess when asked to identify negation in image captions. This deficiency stems from their training methodology. VLMs learn by associating images with positive captions – for instance, “a dog jumping over a fence.” Crucially, their training datasets rarely contain examples of what is not in an image (e.g., “a dog jumping over a fence, with no helicopters”). This absence of negative examples prevents the models from learning to process words like “no,” “doesn’t,” or “without.”

This inherent limitation leads to what the researchers term “affirmation bias,” where VLMs tend to ignore negation words and focus solely on the objects present in the images. This bias was consistent across every VLM tested, regardless of how negation was expressed. Marzyeh Ghassemi, a senior author and associate professor at MIT’s Department of Electrical Engineering and Computer Science (EECS), emphasized the gravity of the finding: “If something as fundamental as negation is broken, we shouldn’t be using large vision/language models in many of the ways we are using them now — without intensive evaluation.”

To quantify this issue, the MIT team devised two benchmark tasks. First, they used a large language model (LLM) to generate new captions for existing images, intentionally incorporating negation words describing absent objects. When tested on retrieving images based on these negated captions, VLM performance plummeted by nearly 25 percent. In the second task, multiple-choice questions where captions differed only by the presence or absence of negation, the best models achieved a mere 39 percent accuracy, with some performing at or below random chance.

Recognizing that this is a “solvable problem,” the researchers took a crucial first step toward a remedy. They developed new datasets containing millions of image-text caption pairs, prompting an LLM to propose related captions that explicitly stated what was excluded from the images. By fine-tuning VLMs with this negation-rich dataset, they observed notable performance gains: image retrieval abilities improved by approximately 10 percent, and accuracy in the multiple-choice question answering task jumped by about 30 percent.

While acknowledging their solution isn’t “perfect” – essentially a form of data augmentation – it serves as a strong signal for future research. The team hopes their work encourages both fellow researchers to delve deeper into the root causes and potential users to rigorously test VLMs for such shortcomings before deployment. Future avenues include exploring methods to process text and images separately within VLMs and creating application-specific datasets, particularly for critical fields like healthcare.

The comprehensive study was a collaborative effort involving Kumail Alhamoud, Marzyeh Ghassemi, Shaden Alshammari, Yonglong Tian of OpenAI, Guohao Li, Philip H.S. Torr, and Yoon Kim. Their findings underscore the urgent need for robust evaluation and fundamental improvements in AI models before their widespread application in sensitive domains.