As the tug-of-war between AI companies and website owners intensifies, one player’s approach is raising particularly sharp concerns. Perplexity, a growing AI-powered answer engine, has been observed engaging in stealth crawling—deliberately evading standard mechanisms used by websites to control bot access.
Traditionally, crawlers are expected to identify themselves through user agents, respect directives in robots.txt
files, and adhere to rate limits and content access policies. These rules—formalized in standards like RFC 9309—ensure a baseline of trust and predictability across the open web. However, according to a Cloudflare investigation, Perplexity appears to be actively circumventing these rules.
Cloudflare reports that when blocked via robots.txt
or Web Application Firewall (WAF) rules, Perplexity’s infrastructure switches tactics—deploying crawlers that impersonate normal browser traffic. These stealth crawlers use generic user agents such as:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
These requests reportedly originate from a rotating pool of IP addresses and ASNs not affiliated with Perplexity’s declared infrastructure, making them difficult to trace and block. Cloudflare estimates that while Perplexity’s declared bots account for 20–25 million daily requests, an additional 3–6 million requests come from these undeclared sources.
The seriousness of this tactic becomes clear in Cloudflare’s controlled tests. After deploying freshly registered domains, isolated from public indexing and clearly disallowing bot access via robots.txt
, Cloudflare queried Perplexity. The system still returned detailed answers, strongly implying unauthorized access by stealth crawlers. When blocked successfully, answers became vaguer, suggesting that Perplexity compensates by sourcing content from third-party aggregators.
This behavior diverges starkly from what has been accepted as good-citizen crawling. Ethical crawlers—such as those from OpenAI—are observed to respect both robots.txt
directives and network-level blocks. OpenAI’s bots, for example, halt crawling if denied in a robots.txt
or served a block page, and do not attempt to bypass these controls with alternate user agents or IPs.
As the AI arms race intensifies, so does the tension between content protection and large-scale data harvesting. In response to widespread concerns, Cloudflare has expanded its managed bot protection features, allowing over 2.5 million websites to selectively block AI crawling. Signature-based detection now includes stealth bots like those used by Perplexity, even for free-tier users.
The implications go beyond a single company. Stealth crawling threatens the foundational norms that allow website owners to retain agency over how their content is used. The future of a cooperative and open web may depend on enforcement mechanisms, transparency standards, and the technical arms race between evasion and detection.
Efforts are already underway at bodies like the IETF to expand the scope of crawler control standards. In the meantime, Cloudflare’s stance is clear: “Follow the rules, or face the block.”