A federal judge in San Francisco ruled that Anthropic’s use of copyrighted books to train its AI system constitutes fair use under US copyright law. 

Judge William Alsup found that Anthropic’s AI training was “exceedingly transformative”, as the company used the texts to analyse writing, extract uncopyrightable information, and develop new technology, which aligns with the purpose of copyright law to foster creativity.

The judge found that training a large language model (LLM) using text is “transformative” because the purpose is not to reproduce or distribute the books, but to enable the model to learn statistical relationships between words and generate new text based on that understanding.

The training process was compared to a person reading books to learn how to write, not to copy them verbatim.

Moreover, there was no evidence that Anthropic’s model reproduced the books or created outputs that would directly substitute the original works. “No output to the public was even alleged to be infringing,” the court noted.

According to the order, Anthropic “downloaded for free millions of copyrighted books in digital form from pirate sites on the internet”, including sources like LibGen and Pirate Library Mirror (PiLiMi). The company also “purchased copyrighted books”, removed bindings, scanned them, and added them to a searchable, digital archive intended to retain everything forever.

However, the ruling also distinguished between legally acquired books and pirated copies. Alsup ruled that Anthropic’s downloading and storing of over seven million pirated books in a “central library” infringed the authors’ copyrights and was not protected as fair use. 

According to him, while copies used for training models may qualify as fair use, almost any unauthorised copying for the central library, especially via pirated sources, would have been too much.

The judge ordered a trial scheduled for December to determine damages related to this infringement, which could involve substantial statutory penalties.

The case originated from a class action lawsuit filed by authors who alleged that Anthropic used unauthorised copies of their books to train its Claude LLM without permission or compensation. 

While the court recognised the legality of training AI on lawfully obtained works, it criticised Anthropic for relying on pirated materials, noting that acquiring books through piracy is not reasonably necessary for fair use and undermines copyright protections.

The ruling is seen as a significant but complex precedent for the AI industry. It affirms that AI training on copyrighted works can be fair use if done with legally obtained materials, but also signals that companies must avoid piracy to limit legal risks. The decision is expected to influence ongoing and future copyright disputes involving AI, though appeals and further litigation are likely.

OpenAI is facing several legal challenges over alleged unauthorised use of copyrighted material to train its LLMs. A key lawsuit in New York involves The Authors Guild and well-known writers such as George RR Martin and Jodi Picoult, who argue that OpenAI used their works without permission, jeopardising their income from original writing. 

In a separate case, The New York Times has taken legal action against both OpenAI and Microsoft, accusing them of using millions of its articles to develop AI systems that now act as direct competitors in delivering news and information.

Meta has also come under legal scrutiny, with authors like Richard Kadrey and Sarah Silverman claiming the company used vast collections of copyrighted content, sourced through torrent sites like LibGen and Sci-Hub, to train its models.

Meanwhile, Stability AI is being sued by Getty Images, which alleges that the company copied millions of its photos and metadata to train its AI image tools. The complaint includes claims that the system produced images featuring Getty’s watermark. Moreover, Midjourney is facing legal pressure from media giants, including Disney and Universal, for allegedly generating AI-based images that replicate their copyrighted material.

The post Anthropic Proves in Court Using Copyright Content for AI Training is Fair appeared first on Analytics India Magazine.