By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Common Crawl accused of giving paywalled content to AI companies
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Common Crawl accused of giving paywalled content to AI companies
News

Common Crawl accused of giving paywalled content to AI companies

News Room
Last updated: 2025/11/10 at 12:03 AM
News Room Published 10 November 2025
Share
Common Crawl accused of giving paywalled content to AI companies
SHARE

If you’ve ever wondered how AI companies like Google, Anthropic, OpenAI, and Meta get their training data from paywalled publishers such as the New York Times, Wired, or the Washington Post, we may finally have an answer.

In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes. According to the report, Common Crawl, whose database spans multiple petabytes, has effectively opened a backdoor that allows AI companies to train their models on paywalled content from major news outlets. In a blog post published today, Common Crawl strongly denies the accusations.

The foundation’s website claims its data is collected from freely available webpages. But its executive director, Richard Skrenta, told The Atlantic he believes AI models should be able to access everything on the internet. “The robots are people too,” Skrenta told The Atlantic.

SEE ALSO:

California greenlights AI safety, data protection, Netflix quiet

AI chatbots like ChatGPT and Google Gemini have sparked a crisis for the journalism industry. AI chatbots scrape information from publishers and share this information directly with readers, taking clicks and visitors away from those publishers. This phenomenon has been called the traffic apocalypse and the AI armageddon. (Disclosure: Ziff Davis, Mashable’s parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

As stated in the Atlantic report, some news publishers have become aware of Common Crawl’s activities, and some have blocked the foundation’s scraper by adding an instruction to their website’s code. However, that only protects future content, not anything that’s already been scraped.

Mashable Light Speed

Multiple publishers have requested that Common Crawl remove their content from its archives. The foundation has stated that it’s complying, albeit slowly, due to the sheer volume of data, with one organization sharing multiple emails from Common Crawl with The Atlantic that the removal process was “50 percent, 70 percent, and then 80 percent complete.” Yet Reisner found that none of those takedown requests seem to have been fulfilled — and that Common Crawl’s archives haven’t been modified since 2016.

Skrenta told The Atlantic that the file format used for storing the archives is “meant to be immutable,” meaning content can’t be deleted once it’s added. However, Reisner reports that the site’s public search tool, the only non-technical way to browse Common Crawl’s archives, returns misleading results for certain domains — masking the scope of what has been scraped and stored.

Mashable reached out to Common Crawl, and a team member pointed us to a public blog post from Skrenta. In it, Skrenta denied claims that the organization misled publishers, stating that its web crawler does not bypass paywalls. He also emphasized that Common Crawl is financially independent and “not doing AI’s dirty work.”

“The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has ‘lied to publishers’ about our activities,” the blog post says. It further states, “Our web crawler, known as CCBot, collects data from publicly accessible web pages. We do not go ‘behind paywalls,’ do not log in to any websites, and do not employ any method designed to evade access restrictions.”

However, as Reisner reports, Common Crawl has previously received donations from OpenAI, Anthropic, and other AI-focused companies. It also lists NVIDIA as a “collaborator” on its website. Beyond collecting raw text, Reisner writes, the foundation also helps assemble and distribute AI training datasets — even hosting them for broader use.

Whatever the case, the fight over how the AI industry uses copyrighted material is far from over. OpenAI, for example, remains at the center of several lawsuits from major publishers, including the New York Times and Mashable’s parent company, Ziff Davis.

Topics
Artificial Intelligence

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article President Trump greeted with mostly boos, some cheers at Commanders-Lions game
Next Article AI factories face a long payback period but trillions in upside –  News AI factories face a long payback period but trillions in upside – News
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Travelers Are Saving Money On Ubers To And From Airports – Here’s How – BGR
Travelers Are Saving Money On Ubers To And From Airports – Here’s How – BGR
News
High Security, Low Price: This Eufy Solar Powered Security Camera Is Now 42% Off
High Security, Low Price: This Eufy Solar Powered Security Camera Is Now 42% Off
News
Best MacBook deal: Get the M4 MacBook Air for its lowest price ever
Best MacBook deal: Get the M4 MacBook Air for its lowest price ever
News
Apple @ Work: How Apple Vision Pro is helping redefine accessibility through non-invasive brain-computer interfaces – 9to5Mac
Apple @ Work: How Apple Vision Pro is helping redefine accessibility through non-invasive brain-computer interfaces – 9to5Mac
News

You Might also Like

Travelers Are Saving Money On Ubers To And From Airports – Here’s How – BGR
News

Travelers Are Saving Money On Ubers To And From Airports – Here’s How – BGR

5 Min Read
High Security, Low Price: This Eufy Solar Powered Security Camera Is Now 42% Off
News

High Security, Low Price: This Eufy Solar Powered Security Camera Is Now 42% Off

4 Min Read
Best MacBook deal: Get the M4 MacBook Air for its lowest price ever
News

Best MacBook deal: Get the M4 MacBook Air for its lowest price ever

3 Min Read
Apple @ Work: How Apple Vision Pro is helping redefine accessibility through non-invasive brain-computer interfaces – 9to5Mac
News

Apple @ Work: How Apple Vision Pro is helping redefine accessibility through non-invasive brain-computer interfaces – 9to5Mac

7 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?