
AI Learns Audio-Visual Connections Without Human Help
In a significant step towards artificial general intelligence, researchers at MIT and other institutions have developed a new AI model that learns the intricate connections between vision and sound without any human intervention. This innovative approach mimics how humans naturally learn, such as recognizing that a cellist’s movements create the music we hear.
The model’s enhanced ability to align audio and visual data from video clips could revolutionize fields like journalism and film production. Imagine AI automatically curating multimodal content or efficiently retrieving specific video and audio segments.
Looking ahead, this technology holds immense potential for improving robots’ understanding of real-world environments, where auditory and visual information are intertwined. This research builds upon previous work by the team, focusing on refining machine-learning models to align corresponding audio and visual data from videos without relying on human-provided labels.
The team adjusted their original model’s training process, enabling it to learn a more precise correlation between individual video frames and the accompanying audio. They also implemented architectural adjustments that help the system balance two distinct learning objectives, ultimately boosting performance.
These improvements significantly enhance the accuracy of video retrieval tasks and the classification of actions in audiovisual scenes. For instance, the new method can accurately link the sound of a door slamming with the visual of it closing in a video clip.
According to Andrew Rouditchenko, an MIT graduate student and co-author of the research paper, “We are building AI systems that can process the world like humans do, in terms of having both audio and visual information coming in at once and being able to seamlessly process both modalities. Looking forward, if we can integrate this audio-visual technology into some of the tools we use on a daily basis, like large language models, it could open up a lot of new applications.”
The research, which will be presented at the Conference on Computer Vision and Pattern Recognition, introduces CAV-MAE Sync, an improved model building upon the initial CAV-MAE. This model processes unlabeled video clips, encoding visual and audio data separately into tokens. Utilizing the recording’s natural audio, it learns to map corresponding audio and visual tokens close together within its internal representation space.
Researchers found that balancing the model’s learning process with two learning objectives allowed CAV-MAE to understand audio-visual correspondence while improving its ability to recover video clips that match user queries.
CAV-MAE Sync splits the audio into smaller windows, generating separate representations corresponding to each window of audio. During training, the model learns to associate one video frame with the audio that occurs during that specific frame.
The model also incorporates dedicated “global tokens” for contrastive learning and “register tokens” to focus on important details for reconstruction, adding “wiggle room” and improving overall performance.
Ultimately, these enhancements significantly improved the model’s ability to retrieve videos based on audio queries and predict the class of an audio-visual scene, with more accurate results than prior work and more complex methods requiring larger datasets.
The team plans to integrate new models that generate better data representations into CAV-MAE Sync and enable the system to handle text data, paving the way for an audiovisual large language model.



