Recently, there has been a concern related to the battle between Cloudflare, the largest network infrastructure provider, and Perplexity, the startup AI search engine.
The rising dispute is all about the increasing unrest surrounding how AI companies collect and use online data, the concept of an open web, and if these companies could significantly shifts the landscape of web standards.
What are Cloudflareโs Assertions?
Cloudflare had observed the stealth crawling behavior of Perplexity a few days back and has de-listed its verified web crawler bot. The company with worldโs biggest CDN, Cloudflare claims that Perplexity has been stealthily collecting data from websites that specifically prohibit its bots from doing so, which violates the directives in the robots.txt files of those websites.
The robots.txt file instructs search engines about which parts of your website the crawlers can access and which they cannot. It is a Robots Exclusion protocol, an internet standard that came into effect in 2022 by the IETF.
Cloudflare launched a โpay-per-crawl” service in June that allows sites to charge AI companies for crawling their sites.
The companyโs CEO Matthew Prince, said, โCloudflare is giving content creators and publishers more control over how their content is accessed, despite the fact that some mischaracterize user-driven AI assistants as malicious bots.”
He further stated that unregulated AI crawling poses an โexistential threatโ to content creators, as 2.5 million sites have chosen to block AI training since July.
In this scenario, Cloudflare states that when Perplexityโs user agents (Perplexity Bot and Perplexity-User) are blocked or encounter WAF rules, they should switch to undeclared agents and rotate IP addresses to evade detection and gain access.
How is Perplexityโs Action?
Now, the point is that Perplexity has not directly denied using these techniques; moreover, they stated that the issue is a misunderstanding of how AI assistants operate compared to traditional web crawlers.
Perplexity maintains that traditional search engines crawl hundreds of millions of pages to create static indexes, regardless of whether a user has requested them. On the other side, Perplexityโs โuser-drivenโ agents fetch content only in response to a particular user request.
Perplexity explained that, unlike traditional web crawlers that systematically index millions of pages, it only pulls and summarizes the content needed to respond to a specific query. When LegoStarโs Cloudflare stated that Perplexity ignores robots.txt, Pubcon founder Brett Tabke noted that this behavior had already been observable in server logs, and robots.txt had never been a significant roadblock.
They contend that since this situation differs significantly from indiscriminate crawling, it shouldn’t be subject to the exact robots.txt requirements.
That being said, the company states that these requests are temporary and not intended for training AI models, and they were made on behalf of an individual to retrieve information. Perplexity, the search engine, also believes that Cloudflare is misidentifying its traffic as Browser Base, a cloud-based browser service.
The Industry Context
A conflict is emerging as AI chatbots are replacing traditional search engines for retrieving information. This has become a point of debate. AI is everywhere, and companies are ramping up to stay ahead in the competition. Google has already developed AI Overviews that provide summaries even before displaying links to websites.
Although users receive a quick solution to their query, this ultimately reduces traffic to publishers. The CloudflareโPerplexity dispute thus raises a larger, unresolved issue: In AI services’ attempts to create fresh, high-quality data, how do they respect publishers’ control rights, including monetization and the use of their work?
Traditional Web Ecosystems Staying Behind
The traditional web ecosystem is different and has been in place for decades; it sent users to publishers’ sites. Here, creators could earn money through ads or subscriptions. While AI-powered answer search engines are a barrier to this, as they offer direct summaries, thereby refraining users from visiting the primary source.
“This is why Cloudflare’s bot blockers are concerning,” said Tabke, pointing out that publishers utilizing Google Search cannot avoid having their content used for Google’s AI training or summaries, and will lose visibility in search results.
Lastly, as AI chatbots are set to become the standard method by which people look for information, they have raised many questions despite the benefits they bring.
Check out our news section to keep in the loop on the latest happenings globally.
Also Read:
What is Agentic AI? Key Benefits and Use Cases
Virtual assistants vs. Chatbots: Whatโs the Difference
Agentic AI: The New Trend Shaping Autonomous Decision-Making in the Tech World