Some people are defending Perplexity after Cloudflare ‘named and shamed’ it

Cloudflare recently accused AI search engine Perplexity of stealthily scraping websites, bypassing explicit directives designed to block AI web crawlers. This incident has sparked a debate, with many rallying to Perplexity’s defense, arguing that its actions, while controversial, are acceptable and questioning how AI agents should be treated compared to human users.

The core of the controversy lies in Perplexity allegedly accessing websites despite specific `robots.txt` rules meant to prevent AI crawling. Cloudflare, a prominent provider of web security services, detailed how it set up a new website with a domain specifically configured to block Perplexity’s known AI bots. Despite these measures, when queried, Perplexity provided information about the site. Cloudflare researchers reported that Perplexity used a generic browser, impersonating Google Chrome on macOS, to circumvent these blocks.

Cloudflare CEO Matthew Prince publicly shared these findings on X (formerly Twitter), stating, “Some supposedly ‘reputable’ AI companies act more like North Korean hackers. Time to name, shame, and hard block them.” This strong stance positioned Perplexity’s behavior as a breach of ethical web crawling practices.

However, the narrative quickly gained counterpoints. Many users on platforms like X and Hacker News defended Perplexity. Their primary argument is that if a human user requests content from a website, that content should be provided. They question why an AI agent accessing the same information on behalf of a user should be treated differently from a human using a browser like Firefox. As one user on Hacker News put it, “why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser?”

A spokesperson for Perplexity initially denied that the bots were company-owned and suggested Cloudflare’s report was a sales pitch. Subsequently, Perplexity published a blog post defending its actions, attributing the behavior to a third-party service it occasionally uses. The company argued that the distinction between automated crawling and user-driven fetching is crucial, stating, “This controversy reveals that Cloudflare’s systems are fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats.”

Cloudflare contrasted Perplexity’s approach with that of OpenAI, which it stated respects `robots.txt` and does not attempt to evade blocking directives. OpenAI’s ChatGPT Agent, Cloudflare noted, signs HTTP requests using the proposed open standard, Web Bot Auth. This standard, supported by Cloudflare and being developed by the Internet Engineering Task Force, aims to create a cryptographic method for identifying AI agent web requests.

The debate arrives amid a significant shift in internet traffic. Bots, particularly those training AI models, have become a substantial challenge, especially for smaller websites. An Imperva report indicated that bot activity now outstrips human activity online, with AI traffic accounting for over 50%. Malicious bots, including those performing scraping and unauthorized login attempts, constitute 37% of all internet traffic.

Historically, websites have relied on methods like CAPTCHAs and services like Cloudflare to block malicious bots, while allowing legitimate crawlers like Googlebot through `robots.txt`. This system facilitated indexing and drove traffic. However, AI language models (LLMs) are increasingly consuming this traffic. Gartner predicts a 25% drop in search engine volume by 2026, largely due to AI chatbots and virtual agents.

The dilemma intensifies as users increasingly adopt AI agents for tasks like travel booking and shopping. If users delegate these activities to AI agents, blocking these agents could inadvertently harm website business interests. User opinions reflect this tension: some advocate for AI agents to access public content freely on their behalf, while others emphasize the right of website owners to control access and drive direct traffic for potential ad revenue.

One user on X summarized the challenge: “This is why I can’t see ‘agentic browsing’ really working — much harder problem than people think. Most website owners will just block.” This ongoing discussion highlights the evolving landscape of AI interaction with the open web and the complex questions surrounding data access, privacy, and website autonomy.