Harvard University has announced the release of a high-quality dataset comprising nearly one million public-domain books, offering a valuable resource for training AI models. Developed by Harvard’s Institutional Data Initiative (IDI) with funding from Microsoft and OpenAI, the dataset includes works from the Google Books project that are no longer under copyright.
This new collection is significantly larger than the Books3 dataset, which was used for training models like Meta’s Llama. The database spans a diverse range of genres, time periods, and languages, featuring classic literature from Shakespeare and Dante alongside specialized texts like Welsh dictionaries and Czech math guides. Greg Leppert, IDI’s executive director, stated that the project aims to democratize access to premium training data traditionally available only to major tech firms. “It’s undergone extensive review,” Leppert added.
Leppert likened the dataset to Linux, suggesting it could serve as a foundation for AI development while still requiring additional data for differentiation. Microsoft’s Burton Davis echoed this sentiment, describing the initiative as aligned with the company’s goal to create “accessible data pools” for AI innovation. Although Microsoft isn’t replacing its proprietary training data with public-domain content, Davis emphasized the potential of resources like Harvard’s database for fostering innovation.
Amid ongoing lawsuits over the use of copyrighted content for AI training, the future of such practices remains uncertain. Regardless, projects like the IDI dataset anticipate a growing demand for public-domain training materials, bypassing legal risks. Harvard is also collaborating with the Boston Public Library to digitize public-domain newspaper articles, with plans for more partnerships.
The dataset’s release details are yet to be finalized, though Harvard has sought Google’s collaboration for distribution. Google’s Kent Walker affirmed the company’s support for the project. Similar initiatives, such as France’s Common Corpus, highlight a broader shift toward public-domain resources. Ed Newton-Rex, an advocate for ethical AI, argued that datasets like Harvard’s refute claims that copyrighted materials are essential for training effective models. However, he cautioned that these resources must replace, not supplement, unauthorized data to truly reshape AI development practices.