Home Blog Newsfeed EleutherAI Unveils Massive Open-Source AI Training Dataset, Challenging Copyrighted Data Dominance
EleutherAI Unveils Massive Open-Source AI Training Dataset, Challenging Copyrighted Data Dominance

EleutherAI Unveils Massive Open-Source AI Training Dataset, Challenging Copyrighted Data Dominance

EleutherAI, a prominent AI research organization, has announced the release of the Common Pile v0.1, a vast collection of licensed and open-domain text intended to serve as a powerful resource for training artificial intelligence models. This significant development aims to provide an alternative to the controversial practice of training AI on copyrighted material.

The Common Pile v0.1, the result of a two-year collaborative effort involving AI startups like Poolside and Hugging Face, along with academic institutions, weighs in at a staggering 8 terabytes. This dataset was instrumental in training EleutherAI’s new AI models, Comma v0.1-1T and Comma v0.1-2T. EleutherAI asserts that these models achieve performance levels comparable to those trained on unlicensed, copyrighted data, challenging the necessity of using copyrighted material for optimal AI development.

The release comes amid ongoing legal battles surrounding AI companies’ training practices. Companies like OpenAI face lawsuits over their reliance on web scraping, which often includes copyrighted material such as books and research journals, to build training datasets. While some AI firms have licensing agreements with content providers, many claim protection under the fair use doctrine.

EleutherAI argues that these legal challenges have significantly reduced transparency within the AI industry, hindering the broader research community’s ability to understand model functionality and potential flaws.

Stella Biderman, Executive Director of EleutherAI, stated in a blog post on Hugging Face, “[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in.” She further noted that researchers at some companies have cited these lawsuits as a reason for their inability to release research in data-centric areas.

The Common Pile v0.1, available for download on Hugging Face and GitHub, was created with legal consultation and incorporates sources like 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also utilized OpenAI’s Whisper, an open-source speech-to-text model, to transcribe audio content.

EleutherAI highlights Comma v0.1-1T and Comma v0.1-2T as proof that the Common Pile v0.1 is curated effectively enough to enable the development of models that rival proprietary alternatives. These models, with 7 billion parameters each, were trained on only a fraction of the Common Pile v0.1 and reportedly compete with models like Meta’s Llama AI on benchmarks including coding, image understanding, and math.

Biderman emphasized, “In general, we think that the common idea that unlicensed text drives performance is unjustified. As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

The Common Pile v0.1 also signifies a move to correct EleutherAI’s previous actions. The organization had previously released The Pile, an open collection of training text that included copyrighted material, which led to controversy and legal scrutiny for companies using it to train AI models.

EleutherAI has committed to more frequent releases of open datasets in the future, collaborating with research and infrastructure partners to expand the availability of high-quality, legally sound training data.

Updated: Biderman clarified on X that the University of Toronto played a key role in leading the research and development of the datasets and models, emphasizing the collaborative nature of the project.

Add comment

Sign Up to receive the latest updates and news

Newsletter

Bengaluru, Karnataka, India.
Follow our social media
© 2025 Proaitools. All rights reserved.