Perplexity has long been accused of deliberately bypassing anti-scraping measures to retrieve web content. While the company has historically dismissed these accusations as disingenuous or misunderstandings, a new report shows that not only is the practice still happening, but it may actually be getting worse.
Perplexity’s main counter-argument: semantics
The issue with Perplexity’s web crawling practices first came to light in June 2024, when Wired and other media outlets accused the company of ignoring the Robots Exclusion Protocol, and pulling content from their websites.
At the time, Perplexity CEO Aravind Srinivas said the culprit was an unnamed third-party web crawling vendor, and that there was “a basic misunderstanding of the way this works.”
It wasn’t long before other publications started accusing Perplexity of plagiarism and unethical web scraping, with The New York Times and the BBC even issuing legal threats. At the time, Perplexity said the BBC was being “manipulative and opportunistic”, and had a “fundamental misunderstanding of technology, the internet and intellectual property law”.
Since then, Perplexity has repeatedly denied this line of accusation, disputing the definition of crawling and scraping in specific use cases. As Wired reported:
In other words, if a user manually provides a URL to an AI, Perplexity says its AI isn’t acting as a web crawler but rather a tool to assist the user in retrieving and processing information they requested. But to Wired and many other publishers, that’s a distinction without a difference because visiting a URL and pulling the information from it to summarize the text sure looks a whole lot like scraping if it’s done thousands of times a day.
Likewise, Srinivas has promised in the past that the company would make it easier to go to the original source of the content surfaced by their answer engine. However, this does not address the fact that the problem is in the sourcing of information, rather than just how it’s presented.
Cloudflare says Perplexity is going out of its way to go after data it is explicitly being told not to crawl
Today, Cloudflare published a report that claims that even when a server specifically denies all automated access, and includes specific rules that block crawling from Perplexity’s public crawlers, Perplexity reportedly does it anyway.
According to Cloudflare:
“We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked. Both their declared and undeclared crawlers were attempting to access the content for scraping contrary to the web crawling norms as outlined in RFC 9309. This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare. In addition to rotating IPs, we observed requests coming from different ASNs in attempts to further evade website blocks. This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.”
In a statement to The Verge, Perplexity called the blog post a “publicity stunt”, and said that “there are a lot of misunderstandings in the blog post.”
To be fair, the accusation of unduly scraping or pulling web content to present it as part of an AI-generated answer is definitely not exclusive to Perplexity. In the past, OpenAI’s crawling practices were likened to DDoS attacks. The same goes for Anthropic.
It’s also worth remembering that the Robots Exclusion Protocol isn’t a law, but rather a widely followed convention. Still, Cloudflare’s investigation specifically called out Perplexity, which also happens to be the company reportedly under Apple’s consideration for an acquisition. So here we are.
Does Apple really need this headache?
There is absolutely nothing stopping Apple from acquiring Perplexity. In fact, I currently believe that it is more likely that Apple will acquire it, than not. To be perfectly honest, I’m half-expecting the announcement to come out before I’m done writing this piece.
And Apple should buy a company like Perplexity.
But given Apple’s stance on privacy and on doing what is right, should it really acquire a company with such a loaded background and, frankly, attitude?
It is perfectly possible that Apple may believe that under its culture, under its leadership, and under its ethical web crawling practices, it may be able to render the inbound tech free of the supposed sins of the past. But this wouldn’t erase the fact that Perplexity only got to where it got because it did what it reportedly did.
Of course, if Apple decides to acquire Perplexity, that will (hopefully) mean that the company did its due diligence, and didn’t find anything legally compromising.
But it might also mean Apple feels pressured enough to compromise, however slightly, on its core principles to catch up. And if that turns out to be the case, it would be more disappointing than its current lag in AI.
AirPods deals on Amazon
FTC: We use income earning auto affiliate links. More.