Reddit Inc. said today it has decided to block the Internet Archive from indexing its popular web forums in order to prevent sneaky artificial intelligence firms from scraping its content for training purposes.
Reddit reportedly found evidence that AI companies were scraping its content via the Internet Archive’s platform, after it restricted them from doing so using its official website. The decision means that the organization’s popular Wayback Machine service will no longer be able to archive Reddit pages, threads, profiles or comments – nothing, except for what’s shown on its homepage.
A report in The Verge means that, going forward, the archive will only be able to show what posts and news headlines were popular on any given day. Previously, Wayback Machine was able to archive every single page, documenting everything that was posted onto the “front page of the internet,” as Reddit proclaims itself to be.
Reddit did not say which AI companies were using the Wayback Machine to get around its prohibitions on them scraping its content. A spokesperson for the company told The Verge that it has “become aware of instances where AI companies violate platform policies… and scrape data from the Wayback Machine.”
The company seems to think that the Internet Archive should be taking steps to prevent this scraping, so there’s hope that the decision won’t be a permanent one. However, the report also highlights a concern by Reddit that Wayback Machine has a tendency to archive user’s posts and comments that are later deleted, saying that this is problematic for user privacy.
“Until they’re able to defend their site and comply with platform policies, we’re limiting some of their access to Reddit data to protect redditors,” the company said.
Although Reddit raises the issue of user privacy, it’s likely that its primary motivation for blocking the scrapers is money. AI companies are expressly prohibited from crawling its website, unless they’re willing to pay to access that data. Several companies have taken Reddit up on that offer, notably Google LLC and OpenAI.
Reddit has never revealed how much its deal with OpenAI is worth, but the agreement with Google is reportedly worth around $60 million. Reddit has also stated previously that it hopes to generate as much as $200 million from such licensing agreements over the next three years.
One company that doesn’t seem prepared to pay up is Anthropic PBC. In June, Reddit filed a lawsuit against it, saying it was continuing to scrape its content even after it claimed it was no longer doing so.
The Internet Archive isn’t the first organization to be blocked by Reddit over scraping concerns. In June 2024, the social media firm said it had blocked Microsoft Corp.’s Bing and smaller search engines, such as DuckDuckGo, Mojeek and Qwant, in order to prevent its content being scraped through their archives.
It’s not immediately clear if the Internet Archive will try and take steps to prevent its archives from being scraped so it can get Reddit’s restrictions lifted. In a statement, Wayback Machine Director Mark Graham said his team is engaged in “ongoing discussions about this matter.”
Image: News/Microsoft Designer
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
- 15M+ viewers of theCUBE videos, powering conversations across AI, cloud, cybersecurity and more
- 11.4k+ theCUBE alumni — Connect with more than 11,400 tech and business leaders shaping the future through a unique trusted-based network.
About News Media
Founded by tech visionaries John Furrier and Dave Vellante, News Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.