AI Learns Audio-Visual Connection Without Human Help

AI Learns Audio-Visual Connection Without Human Help

In a stride toward more human-like AI, researchers from MIT and collaborating institutions have developed a new approach enabling AI models to learn the connections between vision and sound without human intervention. This advancement holds potential for applications in areas like journalism, film production, and robotics.

The team’s work focuses on improving a machine-learning model’s ability to align audio and visual data from video clips. Unlike previous methods, this new technique eliminates the need for manually labeled data, making the learning process more efficient and scalable.

“We are building AI systems that can process the world like humans do, in terms of having both audio and visual information coming in at once and being able to seamlessly process both modalities,” says Andrew Rouditchenko, an MIT graduate student and co-author of the research paper. He envisions integrating this technology into everyday tools like large language models, opening doors to new applications.

The researchers built upon their earlier work, CAV-MAE, by adjusting the training process to achieve finer-grained correspondence between video frames and accompanying audio. They also refined the system’s architecture to better balance distinct learning objectives, resulting in improved overall performance.

These enhancements led to increased accuracy in video retrieval tasks and audiovisual scene classification. The improved model, named CAV-MAE Sync, can precisely link sounds, such as a door slamming, with corresponding visual events in a video clip.

CAV-MAE Sync divides audio into smaller segments, generating separate representations that correspond to each audio window. This allows the model to associate a specific video frame with the audio occurring during that exact frame.

The model also incorporates contrastive and reconstruction objectives, enhanced by new data representations called “global tokens” and “register tokens,” which improve the model’s learning ability.

The researchers aim to incorporate new models that generate better data representations into CAV-MAE Sync in the future. They also plan to enable the system to handle text data, which would be a crucial step toward creating an audiovisual large language model.

Add comment

Sign Up to receive the latest updates and news

Newsletter

© 2025 Proaitools. All rights reserved.