
Study Reveals Vision-Language Models Struggle with Negation in Queries
Vision-language models (VLMs), a type of AI used in various applications from medical diagnosis to manufacturing, often fail to understand negation, according to a recent study by MIT researchers. This critical flaw can lead to potentially serious errors in real-world scenarios.
The study, led by Kumail Alhamoud, an MIT graduate student, highlights that VLMs struggle with words like “no” and “doesn’t,” which specify what is false or absent. This misunderstanding can drastically alter the interpretation of information, leading to incorrect conclusions. For instance, in a medical context, a VLM might misinterpret a chest X-ray report, potentially affecting a patient’s diagnosis and treatment.
“Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences,” says Kumail Alhamoud. The research paper is available on arXiv.
The researchers tested VLMs’ ability to identify negation in image captions and found that the models performed poorly. To address this, they created a dataset of images with captions including negation words describing missing objects. Retraining VLMs with this dataset improved performance in retrieving images that do not contain certain objects and boosted accuracy in multiple-choice question answering with negated captions.
Despite these improvements, the researchers caution that more work is needed to address the fundamental causes of this problem. They emphasize the importance of intensive evaluation before deploying VLMs in high-stakes settings.
Marzyeh Ghassemi, an associate professor at MIT, stresses the broader implications: “If something as fundamental as negation is broken, we shouldn’t be using large vision/language models in many of the ways we are using them now — without intensive evaluation.”
The team’s findings reveal that VLMs are primarily trained on image-caption datasets that lack examples of negation, leading to their inability to identify it. The models use two separate encoders, one for text and one for images, and the encoders learn to output similar vectors for an image and its corresponding text caption. Because the image-caption datasets don’t contain examples of negation, VLMs never learn to identify it.
To delve deeper, the researchers designed benchmark tasks. They re-captioned images using a large language model (LLM) to include related objects not in the image, testing the models’ ability to retrieve images containing certain objects but not others. They also designed multiple-choice questions that required VLMs to select the most appropriate caption from closely related options, differing only by the presence or negation of an object.
The models frequently failed, with image retrieval performance dropping significantly with negated captions. The researchers identified “affirmation bias” as a key factor, where VLMs ignore negation words and focus on objects in the images. This issue persisted across all tested VLMs.
To mitigate this, the researchers developed datasets with negation words, using an LLM to generate captions specifying what is excluded from images. Finetuning VLMs with these datasets led to performance gains, improving image retrieval abilities and multiple-choice question answering accuracy.
Alhamoud suggests users carefully consider the problems they want to solve with VLMs and test them thoroughly before deployment. Future research could focus on teaching VLMs to process text and images separately and developing application-specific datasets, such as those for health care.