Skip to main content
  • Home
  • Tech
  • Nvidia piracy-data claim fuels AI copyright fights, rules diverge by country

Nvidia piracy-data claim fuels AI copyright fights, rules diverge by country

Picture

Member for

6 months 3 weeks
Real name
Aoife Brennan
Bio
Aoife Brennan is a contributing writer for The Economy, with a focus on education, youth, and societal change. Based in Limerick, she holds a degree in political communication from Queen’s University Belfast. Aoife’s work draws connections between cultural narratives and public discourse in Europe and Asia.

Modified

Nvidia faces growing copyright infringement allegations over use of pirated book data to train AI
Copyright lawsuits over generative AI training continue to pile up worldwide
Regulatory gaps widen by country as the U.S., Japan, and the EU move toward looser rules

Allegations have surfaced that Nvidia’s board approved a plan to use the world’s largest database of pirated books to train artificial intelligence (AI) models. As the generative AI market expands rapidly, copyright disputes over AI training data continue to accumulate. With legal standards and regulatory intensity varying widely across countries, market uncertainty is expected to persist for some time.

Did Nvidia use illegal data to train AI?

On the 20th (local time), Japanese outlet Gigazine reported—citing an amended complaint disclosed during the course of a class-action lawsuit involving Nvidia—that Nvidia teams are alleged to have entered into direct talks with Anna’s Archive. Anna’s Archive is a pirate book site that describes itself as “the largest shadow library in human history.” The complaint states that Nvidia sought to collect pirated books through the site to secure large-scale text data needed for AI training.

The controversy traces back to lawsuits filed by multiple authors who accused Nvidia of using the pirated book dataset Book3 to train AI models in 2024. At the time, Nvidia argued that “books are merely probabilistic correlations to an AI model” and that using them for training purposes falls under fair use. Plaintiffs pushed back, saying Nvidia deliberately infringed copyrights to secure data amid intense AI competition, and the amended complaint added an allegation that Nvidia contacted Anna’s Archive to discuss ways to obtain data for AI preprocessing.

Anna’s Archive is said to have told Nvidia that providing high-speed access would require fees in the tens of thousands of dollars. Plaintiffs allege that Nvidia, despite knowing the data had been collected illegally, sought internal approval to obtain as much as 500TB of data. The complaint also claims Nvidia went beyond using the data internally, providing scripts and related tools that enabled customers to automatically download The Pile, a large dataset that includes Book3.

Rising copyright disputes tied to AI

Copyright controversies involving AI companies have been steadily accumulating for some time. Major media groups including The New York Times, Dow Jones, and Ziff Davis have filed copyright lawsuits against OpenAI, arguing that ChatGPT was trained on their articles without permission and reproduces or summarizes their content. AI search engine startup Perplexity was sued by The New York Times on similar grounds last month, and Japan’s Yomiuri Shimbun and Asahi Shimbun are also reported to have brought lawsuits against Perplexity on the same basis.

Numerous disputes have also emerged in the arts. Image-generation startup Midjourney has been sued by major studios including Disney, Universal, and Warner Bros. for allegedly training its models on iconic characters without authorization and generating similar images. In the UK, Getty Images filed a copyright infringement lawsuit against Stability AI, claiming its images were used without consent to train Stable Diffusion. Anthropic, the developer of generative AI Claude, reached a settlement last September agreeing to pay creators at least $1.5 billion in a copyright lawsuit related to AI training data.

Germany’s music copyright collecting society GEMA has likewise sued OpenAI, alleging that the company used German song lyrics as training data without licensing agreements or royalty payments. OpenAI argued that training on lyrics amounted to sequential analysis and combinations of probabilities, claiming the association misunderstood how ChatGPT works. The court, however, sided with GEMA, ruling that OpenAI’s storage of lyrics without permission and their reproduction on demand constituted unauthorized copying and performance. It ordered OpenAI to refrain from storing or outputting the lyrics, and to disclose damages, along with records of lyric usage and revenues generated from them.

Moves toward a more permissive stance in major economies

In some countries, including the United States, court rulings have increasingly found that using copyrighted works for AI training without explicit consent can qualify as “fair use.” Fair use is a concept under U.S. copyright law that allows limited use of protected works without the creator’s permission, and it has become a central legal defense for technology companies. In one notable case, three authors including Andrea Bartz filed a class-action copyright lawsuit against Anthropic, the U.S. AI company behind the large language model Claude. The plaintiffs alleged that Anthropic had unlawfully collected millions of copyrighted works from pirated e-book sites and used them to train its AI models for commercial gain, infringing creators’ rights.

U.S. District Judge William Alsup of the San Francisco federal court, who presided over the case, addressed the issues separately. He ruled that Anthropic’s downloading of more than 7 million books illegally to build a centralized digital library constituted copyright infringement. However, he found that the act of training AI models on copyrighted books without permission was “highly transformative” and therefore qualified as fair use. In the same month, Judge Vince Chhabria of the same court issued a similar ruling in a copyright case involving Meta. Thirteen authors claimed that Meta had unlawfully trained its Llama AI model on their works, harming their market value, but Chhabria ruled that AI training was highly transformative and did not amount to copyright infringement. As precedents interpreting fair use in ways favorable to Big Tech accumulate, creators’ intellectual property rights are increasingly being weakened.

This permissive trend is being observed beyond the United States. Japan revised its copyright law in 2018 to grant immunity for the use of lawfully accessible works, provided such use does not unreasonably harm the interests of copyright holders. Singapore has introduced a separate exception that, alongside a fair use provision, does not treat the use of works for computer data analysis and preparatory activities as copyright infringement. The European Union, which has traditionally favored strict regulation in advanced industries, also allows commercial text and data mining of lawfully accessible works. In November last year, the EU further proposed easing certain restrictions on the use of sensitive data in limited cases, such as correcting AI bias, while clarifying rules on personal data processing through its Digital Omnibus Package.

Picture

Member for

6 months 3 weeks
Real name
Aoife Brennan
Bio
Aoife Brennan is a contributing writer for The Economy, with a focus on education, youth, and societal change. Based in Limerick, she holds a degree in political communication from Queen’s University Belfast. Aoife’s work draws connections between cultural narratives and public discourse in Europe and Asia.