Two American authors have filed a lawsuit against OpenAI, the parent company of ChatGPT, claiming that OpenAI used their works to train artificial intelligence without obtaining copyright authorization.
The lawsuit alleges that Massachusetts-based authors Paul Trembley and Mona Awad claim that ChatGPT directly copied their book data without their consent to conduct training, which is a violation of their copyright. Trembley’s work includes “A Head Full of Ghosts,” while Awad is known for “13 Ways of Looking at a Fat Girl” and “Bunny.”
These two authors claim that the books generated by ChatGPT are identical to their published articles, indicating that their works were included in the database used to train ChatGPT.
Chatbots are trained on a large amount of textual data, and OpenAI has not disclosed the specific data used to train ChatGPT, but the company says that it usually crawls web data, including file libraries and Wikipedia. Books are an ideal choice for training artificial intelligence because they often contain “high-quality, well-edited long articles” that store the essence of human thought.
According to the lawsuit, OpenAI’s training data includes more than 300,000 books, including the controversial “shadow library” with unclear copyright ownership.
Proving how and where ChatGPT collected this information, and whether these authors suffered economic losses, may be a challenge. This is because ChatGPT also uses a large amount of web data for training, including discussions of these books by netizens.
The lawsuit represents copyright owners nationwide in the United States, seeking an undisclosed amount of compensation. Currently, OpenAI representatives have not responded to the matter.
Andres Guadamuz, an expert in intellectual property law at the University of Sussex, said this is the first copyright lawsuit against ChatGPT. He added that the lawsuit will explore the “legal boundaries” of the generative artificial intelligence field.
Just a few days ago, OpenAI was also sued in California for allegedly stealing and misusing a large amount of private data on the Internet to train ChatGPT without permission.