Sopa Images | Light Rocket | Getty Images
social media giant reddit has filed a lawsuit against artificial intelligence company Perplexity for illegally collecting user posts to train AI models. This marks the latest data rights clash between content owners and the AI ​​industry.
The complaint, filed Wednesday in New York federal court, also names three defendants: Lithuanian data scraper Oxylabs, “former Russian botnet” AWMProxy, and Texas startup SerpApi.
Reddit claimed that the three small groups were able to extract copyrighted content “by concealing their identities, hiding their locations, and disguising their web scrapers as ordinary people.”
Perplexity, which operates an AI-powered search engine, has denied the allegations and accused Reddit of “extortion” and opposing an open internet, but SerpApi told CNBC it “strongly disagrees” with Reddit’s claims and intends to defend it in court.
The lawsuit is one of many filed by content owners accusing AI companies of using copyrighted material without permission to train large-scale language models. Reddit in particular has been at the forefront of that fight, filing a similar ongoing lawsuit against AI startup Anthropic in June. CNBC was unable to connect to Oxylabs and AWMProxy.
In a statement shared with CNBC, Ben Lee, Reddit’s chief legal officer, said AI companies are “engaged in an arms race for quality human content” and that pressure is fueling “an industrial-scale ‘data laundering’ economy.”
Scrapers bypass technical protections to steal data and sell it to clients who want training materials. Reddit is a prime target because it has one of the largest and most dynamic collections of human conversations ever created.
Reddit, which hosts a community of more than 100,000 interest-based “subreddits,” said in the lawsuit that user posts became the most commonly cited source of AI-generated answers on Perplexity.
It added that the company sent Perplexity a cease-and-desist letter and subsequently increased its citation volume to Reddit “40 times.”
AI researchers have previously noted that Reddit’s large volume of moderated conversations could help AI chatbots generate more natural responses.
In the age of artificial intelligence, Reddit has sought to leverage its large pool of data and has only granted access to it through AI-related licensing agreements. Social media companies have entered into such agreements with OpenAI; alphabetGoogle.
In response to the lawsuit, Perplexity claimed in a post on the Reddit platform that it does not train its AI models based on content, but merely summarizes and cites public discussions on Reddit. As a result, it was “impossible” to conclude a license agreement.
“After we explained this to you a year ago, Reddit insisted we pay anyway, even though we had legitimate access to Reddit’s data. Bowing to strong-arm tactics is not how we do business,” the statement said, going on to describe the lawsuit as “a show of force in Reddit’s training data negotiations with Google and OpenAI.”
“Perplexity believes this is a sad example of what happens when public data becomes a large part of a public company’s business model,” it added, noting that data licensing has become an increasingly important revenue source for Reddit.
Jen Wong, Reddit’s chief operating officer, told trade publication Adweek in February that AI licensing deals with Google and OpenAI account for nearly 10% of Reddit’s revenue.