IISc and Hugging Face Team Up: Revolutionizing AI with Project Vaani in 2025

A Partnership for Inclusive AI

In a groundbreaking move for artificial intelligence, the Indian Institute of Science (IISc) and Hugging Face have joined forces to supercharge Project Vaani, an initiative capturing India’s rich linguistic tapestry. Launched in 2022 with Google’s backing, Vaani aims to build an open-source, multimodal, multilingual dataset that mirrors the diversity of India’s 1.4 billion voices. Now, in 2025, this collaboration with Hugging Face is set to amplify its global reach, empowering developers worldwide to craft AI solutions that resonate with India’s cultural and linguistic mosaic.

Both IISc and Hugging Face share a vision of democratizing AI through open science. By making Vaani accessible on Hugging Face’s platform, they’re breaking barriers, fostering innovation, and ensuring AI speaks the languages of India—from bustling cities to remote villages.

Vaani Unveiled: A Dataset Like No Other

Project Vaani stands out with its geo-centric approach, collecting spontaneous speech and images from 80 districts across India in Phase 1 alone. As of February 2025, it boasts over 16,000 hours of audio from 84,600 speakers, covering 54 languages, with 790 hours transcribed. This isn’t just data—it’s a living archive of dialects, accents, and real-life conversations, paired with 70,000 images for multimodal applications. From Tamil in the south to Assamese in the northeast, Vaani captures India’s linguistic soul.

Hugging Face hosts this treasure trove, offering subsets like transcribed audio for speech recognition and raw data for broader research. It’s a goldmine for building AI that understands India’s diversity, available to anyone with a Hugging Face account and an access token.

Why This Matters for AI Development

India’s 22 official languages and hundreds of dialects pose a unique challenge—and opportunity—for AI. Most language models lean heavily on English or other global tongues, leaving Indic languages underrepresented. Vaani flips the script, providing a robust dataset for training models in speech recognition, language modeling, and even speaker verification. With its vast speaker pool and real-world audio, it’s ideal for creating AI that’s not just smart but inclusive.

The collaboration amplifies this impact. Hugging Face’s platform, with over 1 million models and datasets, ensures Vaani reaches a global audience, sparking research and applications that could transform digital access in India—from education tools to voice assistants.

Beyond Phase 1: The Road Ahead

Phase 1 is just the beginning. IISc and ARTPARK, with Google’s support, have expanded Vaani to Phase 2, covering all Indian states as of February 2025. The goal? Over 150,000 hours of speech, fully transcribed in local scripts, reflecting India’s urban-rural, age, and gender diversity. Hugging Face’s role will grow, hosting new subsets and encouraging community feedback via vaanicontact@gmail.com to refine and expand the project.

This partnership isn’t static—it’s a call to action. Developers, researchers, and innovators are urged to dive in, build with Vaani, and share insights, driving AI that truly serves India’s billion-plus population.

A Vision for the Future

As of March 16, 2025, the IISc-Hugging Face collaboration is a beacon for open-source AI. It’s not just about data—it’s about empowerment, bridging digital divides, and honoring India’s heritage through technology. Whether it’s enhancing multimodal large language models or crafting code-switching speech systems, Vaani and Hugging Face are paving the way for an AI future that’s as diverse as the world it serves.