Customise Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorised as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyse the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customised advertisements based on the pages you visited previously and to analyse the effectiveness of the ad campaigns.

No cookies to display.

Home Blog AI Vision Aya Vision 2025: A Deep Dive into Cohere For AI’s Multilingual Vision-Language Model
Aya Vision 2025: A Deep Dive into Cohere For AI’s Multilingual Vision-Language Model

Aya Vision 2025: A Deep Dive into Cohere For AI’s Multilingual Vision-Language Model

Aya Vision 2025: A Deep Dive into Cohere For AI’s Multilingual Vision-Language Model

Key Points

  • Aya Vision, developed by Cohere For AI, is an open-source vision-language model supporting 23 languages, excelling in tasks like image captioning and visual question answering.
  • It seems likely that Aya Vision outperforms models like Llama-3.2 90B Vision and Qwen2.5-VL 72B, with win rates of 49-63% on AyaVisionBench.
  • Research suggests its use of synthetic annotations enhances efficiency, using fewer resources while maintaining competitive performance.
  • The evidence leans toward Aya Vision being available on platforms like Hugging Face and WhatsApp, making it accessible for global communication.

Introduction

Aya Vision is a cutting-edge tool in the world of artificial intelligence, designed to handle both images and text across 23 languages. Developed by Cohere For AI, it’s part of a broader effort to make AI more inclusive and effective for people worldwide. This blog post dives into what makes Aya Vision special, how it works, and why it matters in 2025.

Features and Performance

Aya Vision can do a lot, from describing images to answering questions about them and even translating content. It comes in two sizes: 8B and 32B parameters, with the larger model showing strong results compared to others like Llama-3.2 90B Vision. For example, it has win rates of 49-63% on AyaVisionBench, a new benchmark created by Cohere For AI to test such models.

Accessibility and Use

You can try Aya Vision on platforms like Hugging Face, where you can download and experiment with it, or even chat with it on WhatsApp (WhatsApp Integration). This makes it easy for researchers and everyday users to see what it can do, fostering global communication.

Unexpected Detail: Efficiency Through Synthetic Data

One interesting aspect is how Aya Vision uses synthetic annotations—data generated by AI itself—to train. This approach, as noted in a TechCrunch article, helps save resources while keeping performance high, which is great for researchers with limited computing power (TechCrunch Article).

Comprehensive Analysis of Aya Vision by Cohere For AI

Overview and Background

In the dynamic field of artificial intelligence, the demand for models that can process both visual and textual information across multiple languages is growing. Aya Vision, introduced by Cohere For AI on March 4, 2025, addresses this need with a state-of-the-art vision-language model (VLM) supporting 23 languages. This model is part of Cohere’s broader Aya project aimed at advancing multilingual AI (Aya Page).

Features and Capabilities

Aya Vision is versatile, capable of tasks such as image captioning, visual question answering, text generation, and translating both text and images across its 23 supported languages. The model is available in two parameter sizes: 8B and 32B, each tailored for different levels of complexity (RoboFlow Analysis).

Technical Details and Training

Technically, Aya Vision uses the SigLIP2 patch14-384 vision encoder and employs synthetic annotations for training efficiency. This method allows it to achieve high performance with fewer resources, a trend supported by Gartner’s 2024 estimate that 60% of AI training data is synthetic (TechCrunch).

Performance and Benchmarking

The 32B model outperforms models like Llama-3.2 90B Vision and Qwen2.5-VL 72B, with win rates of 49-63% on AyaVisionBench and 52-72% on mWildVision (AyaVisionBench). This benchmark evaluates VLMs across nine task categories in 23 languages.

Applications and Accessibility

Aya Vision is integrated into platforms like WhatsApp and available on Hugging Face (Aya Vision 8B, Aya Vision 32B), fostering global use under a CC BY-NC 4.0 license (VentureBeat).

Community and Future Directions

The community is exploring creative applications, like AI podcasts (Kokoro Podcast Generator), with potential for expanding language support in the future.

Performance Comparison Table

Model AyaVisionBench Win Rate (%) mWildVision Win Rate (%)
Aya Vision 32B 49-63 52-72
Llama-3.2 90B Vision
Molmo 72B
Qwen2.5-VL 72B
Aya Vision 8B Up to 79 Up to 81

Add comment

Sign Up to receive the latest updates and news

Newsletter

Bengaluru, Karnataka, India.
Follow our social media
© 2025 Proaitools. All rights reserved.