Common Crawl Accused Of Giving Paywalled Content To AI Companies

If you’ve ever wondered how AI companies like Google, Anthropic, OpenAI, and Meta get their training data from paywalled publishers such as the New York Times, Wired, or the Washington Post, we may finally have an answer.

In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes. According to the report, Common Crawl, whose database spans multiple petabytes, has effectively opened a backdoor that allows AI companies to train their models on paywalled content from major news outlets. In a blog post published today, Common Crawl strongly denies the accusations.

The foundation’s website claims its data is collected from freely available webpages. But its executive director, Richard Skrenta, told The Atlantic he believes AI models should be able to access everything on the internet. “The robots are people too,” Skrenta told The Atlantic.

SEE ALSO:

California greenlights AI safety, data protection, Netflix quiet

AI chatbots like ChatGPT and Google Gemini have sparked a crisis for the journalism industry. AI chatbots scrape information from publishers and share this information directly with readers, taking clicks and visitors away from those publishers. This phenomenon has been called the traffic apocalypse and the AI armageddon. (Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

As stated in the Atlantic report, some news publishers have become aware of Common Crawl’s activities, and some have blocked the foundation’s scraper by adding an instruction to their website’s code. However, that only protects future content, not anything that’s already been scraped.

Mashable Light Speed

Multiple publishers have requested that Common Crawl remove their content from its archives. The foundation has stated that it’s complying, albeit slowly, due to the sheer volume of data, with one organization sharing multiple emails from Common Crawl with The Atlantic that the removal process was “50 percent, 70 percent, and then 80 percent complete.” Yet Reisner found that none of those takedown requests seem to have been fulfilled — and that Common Crawl’s archives haven’t been modified since 2016.

Skrenta told The Atlantic that the file format used for storing the archives is “meant to be immutable,” meaning content can’t be deleted once it’s added. However, Reisner reports that the site’s public search tool, the only non-technical way to browse Common Crawl’s archives, returns misleading results for certain domains — masking the scope of what has been scraped and stored.

Mashable reached out to Common Crawl, and a team member pointed us to a public blog post from Skrenta. In it, Skrenta denied claims that the organization misled publishers, stating that its web crawler does not bypass paywalls. He also emphasized that Common Crawl is financially independent and “not doing AI’s dirty work.”

“The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has ‘lied to publishers’ about our activities,” the blog post says. It further states, “Our web crawler, known as CCBot, collects data from publicly accessible web pages. We do not go ‘behind paywalls,’ do not log in to any websites, and do not employ any method designed to evade access restrictions.”

However, as Reisner reports, Common Crawl has previously received donations from OpenAI, Anthropic, and other AI-focused companies. It also lists NVIDIA as a “collaborator” on its website. Beyond collecting raw text, Reisner writes, the foundation also helps assemble and distribute AI training datasets — even hosting them for broader use.

Whatever the case, the fight over how the AI industry uses copyrighted material is far from over. OpenAI, for example, remains at the center of several lawsuits from major publishers, including the New York Times and Mashable’s parent company, Ziff Davis.

Topics
Artificial Intelligence