By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Common Crawl accused of giving paywalled content to AI companies
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Common Crawl accused of giving paywalled content to AI companies
News

Common Crawl accused of giving paywalled content to AI companies

News Room
Last updated: 2025/11/10 at 12:03 AM
News Room Published 10 November 2025
Share
Common Crawl accused of giving paywalled content to AI companies
SHARE

If you’ve ever wondered how AI companies like Google, Anthropic, OpenAI, and Meta get their training data from paywalled publishers such as the New York Times, Wired, or the Washington Post, we may finally have an answer.

In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes. According to the report, Common Crawl, whose database spans multiple petabytes, has effectively opened a backdoor that allows AI companies to train their models on paywalled content from major news outlets. In a blog post published today, Common Crawl strongly denies the accusations.

The foundation’s website claims its data is collected from freely available webpages. But its executive director, Richard Skrenta, told The Atlantic he believes AI models should be able to access everything on the internet. “The robots are people too,” Skrenta told The Atlantic.

SEE ALSO:

California greenlights AI safety, data protection, Netflix quiet

AI chatbots like ChatGPT and Google Gemini have sparked a crisis for the journalism industry. AI chatbots scrape information from publishers and share this information directly with readers, taking clicks and visitors away from those publishers. This phenomenon has been called the traffic apocalypse and the AI armageddon. (Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

As stated in the Atlantic report, some news publishers have become aware of Common Crawl’s activities, and some have blocked the foundation’s scraper by adding an instruction to their website’s code. However, that only protects future content, not anything that’s already been scraped.

Mashable Light Speed

Multiple publishers have requested that Common Crawl remove their content from its archives. The foundation has stated that it’s complying, albeit slowly, due to the sheer volume of data, with one organization sharing multiple emails from Common Crawl with The Atlantic that the removal process was “50 percent, 70 percent, and then 80 percent complete.” Yet Reisner found that none of those takedown requests seem to have been fulfilled — and that Common Crawl’s archives haven’t been modified since 2016.

Skrenta told The Atlantic that the file format used for storing the archives is “meant to be immutable,” meaning content can’t be deleted once it’s added. However, Reisner reports that the site’s public search tool, the only non-technical way to browse Common Crawl’s archives, returns misleading results for certain domains — masking the scope of what has been scraped and stored.

Mashable reached out to Common Crawl, and a team member pointed us to a public blog post from Skrenta. In it, Skrenta denied claims that the organization misled publishers, stating that its web crawler does not bypass paywalls. He also emphasized that Common Crawl is financially independent and “not doing AI’s dirty work.”

“The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has ‘lied to publishers’ about our activities,” the blog post says. It further states, “Our web crawler, known as CCBot, collects data from publicly accessible web pages. We do not go ‘behind paywalls,’ do not log in to any websites, and do not employ any method designed to evade access restrictions.”

However, as Reisner reports, Common Crawl has previously received donations from OpenAI, Anthropic, and other AI-focused companies. It also lists NVIDIA as a “collaborator” on its website. Beyond collecting raw text, Reisner writes, the foundation also helps assemble and distribute AI training datasets — even hosting them for broader use.

Whatever the case, the fight over how the AI industry uses copyrighted material is far from over. OpenAI, for example, remains at the center of several lawsuits from major publishers, including the New York Times and Mashable’s parent company, Ziff Davis.

Topics
Artificial Intelligence

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article President Trump greeted with mostly boos, some cheers at Commanders-Lions game
Next Article AI factories face a long payback period but trillions in upside –  News AI factories face a long payback period but trillions in upside – News
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Starlink Tops 8M Global Subscribers As SpaceX Inks Another In-Flight Wi-Fi Deal
Starlink Tops 8M Global Subscribers As SpaceX Inks Another In-Flight Wi-Fi Deal
News
Get Your Holiday Shopping Started With Our Exclusive 15% Off Sale at Nomad
Get Your Holiday Shopping Started With Our Exclusive 15% Off Sale at Nomad
News
Budget TV deal: 0 Hisense 40-inch A4 FHD Fire TV
Budget TV deal: $120 Hisense 40-inch A4 FHD Fire TV
News
China’s “sorpasso” is now a real possibility
China’s “sorpasso” is now a real possibility
Mobile

You Might also Like

Starlink Tops 8M Global Subscribers As SpaceX Inks Another In-Flight Wi-Fi Deal
News

Starlink Tops 8M Global Subscribers As SpaceX Inks Another In-Flight Wi-Fi Deal

5 Min Read
Get Your Holiday Shopping Started With Our Exclusive 15% Off Sale at Nomad
News

Get Your Holiday Shopping Started With Our Exclusive 15% Off Sale at Nomad

3 Min Read
Budget TV deal: 0 Hisense 40-inch A4 FHD Fire TV
News

Budget TV deal: $120 Hisense 40-inch A4 FHD Fire TV

3 Min Read
Apple Plans On Bringing Exciting New Satellite Features To The iPhone – BGR
News

Apple Plans On Bringing Exciting New Satellite Features To The iPhone – BGR

5 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?