Robots.txt After Formalization: Compliance, Defection, and the Limits of Voluntary Protocol

In early 2025, researchers deployed successive versions of robots.txt across 36 institutional websites and watched 130 self-declared bots for 40 days. The data from one bot, ByteSpider — operated by ByteDance and classified by the study as an AI data scraper — looked like this: 0% endpoint access compliance, 0% disallow compliance, 39.8% crawl-delay compliance. The study’s table notes that ByteSpider “does not promise to respect robots.txt.”

The crawl-delay figure is the strange one. Crawl-delay is an informal convention that asks bots to pause between requests; it reduces server load and has been widely adopted by well-behaved crawlers for years. It is not mentioned in RFC 9309, the September 2022 IETF document that formalized robots.txt as a Standards Track protocol with explicit normative requirements after 28 years as an industry practice with no standards body behind it. Crawl-delay is not MUST. It is not SHOULD. It is not there.

ByteSpider partially respects a convention the formal standard omits while showing near-zero compliance with the provisions the formal standard requires. That inversion is the governance question.

The formal standard

RFC 9309 is a Standards Track document — the IETF’s designation for specifications that represent community consensus and have cleared the Internet Engineering Steering Group’s review. It was published in September 2022 and lists four co-authors: Martijn Koster, who wrote the original 1994 robots.txt proposal, and Gary Illyes, Henner Zeller, and Lizzi Sassman, all three of Google LLC.

The abstract says the document “specifies and extends” Koster’s 1994 protocol. The additions are specific: formal ABNF syntax for the file format, UTF-8 encoding requirements, a minimum parsing limit of 500 kibibytes, explicit redirect-following rules (crawlers SHOULD follow at least five consecutive redirects before treating a redirect chain as a disallow), and the most-specific-match rule (the most specific path match MUST be used). What does not appear in the standard, anywhere, is crawl-delay. It is absent from the normative requirements, the ABNF grammar, and the examples.

The MUST requirements that do appear cover the core protocol precisely. Crawlers MUST follow parseable rules if robots.txt is successfully downloaded. If robots.txt is unreachable due to server or network errors, crawlers MUST assume complete disallow. The file MUST be UTF-8 encoded. These requirements have no enforcement mechanism. RFC 9309 says so directly: “These rules are not a form of access authorization.”

That sentence positions RFC 9309 as a formalization of a voluntary norm, not a technical control. This was the design of the 1994 document too. Koster’s June 1994 consensus text — the working document the web adopted — described the standard as “not an official standard backed by a standards body” and “not enforced by anybody.” What RFC 9309 changed between 1994 and 2022 is the precision of the definitions. The compliance model remained voluntary.

What the study measured

Kim et al. (arXiv:2505.21733) ran a controlled experiment across 36 institutional websites, tracking 130 self-declared bots with known user agents. They collected baseline data in January 2025 — a version with restrictions on three paths — and then deployed three successive versions from February 12 to March 29, each for two weeks: a version adding a 30-second crawl-delay; a version restricting most bots to a single endpoint (/page-data); and a version completely disallowing most bots, with eight SEO crawlers exempted.

Compliance was measured across three dimensions. Crawl-delay compliance tracked whether bots actually paused the required time between requests. Endpoint access compliance measured what fraction of a bot’s page accesses went to permitted paths rather than restricted ones. Disallow compliance measured whether bots avoided restricted paths under the complete-disallow version.

The study’s primary finding is directional and holds across categories: stricter directives receive lower compliance. But the breakdown by crawler type is not uniform.

The study distinguishes AI data scrapers — which include ByteSpider, ClaudeBot, and GPTBot — from AI search crawlers, which include AppleBot and Amazonbot. The two categories show a counterintuitive split. AI data scrapers average 56.0% crawl-delay compliance, 35.2% endpoint access compliance, and 76.6% disallow compliance. AI search crawlers average 89.5% crawl-delay compliance, 62.3% endpoint access compliance, and 34.8% disallow compliance. The categories invert across directive types: search crawlers lead on crawl-delay and endpoint access; data scrapers lead on complete-disallow compliance.

ByteSpider is an outlier within its own category. Both its endpoint access compliance and disallow compliance are 0%, against an AI data scraper disallow average of 76.6%. Its crawl-delay compliance is 39.8% — partially respecting a convention not in the standard while ignoring the standard’s provisions almost entirely.

The study found that ByteSpider’s shift toward noncompliance under the endpoint restriction was statistically significant: z = −5.04, p = 4.74×10⁻⁷. This is not a flat line. When the endpoint restriction was deployed, ByteSpider’s compliance moved in the wrong direction by a measurable amount. Whether this means ByteSpider reads robots.txt before deciding to ignore it, or whether the behavioral signature has some other explanation, is not established in the sources. But the negative z-score rules out indifference.

A scope note applies here: 36 sites, one institution’s web logs, 8 weeks. The paper acknowledges these limits. Its findings on specific crawlers should be read as evidence of a pattern rather than a comprehensive compliance census of the wider web. It is, per the paper’s own description, the first controlled study of this kind.

The alignment that made robots.txt work

Between 1994 and roughly 2020, robots.txt compliance among major crawlers was high enough that the protocol functioned. Googlebot, Bingbot, and their predecessors generally followed disallow directives. Nobody enforced anything. The compliance happened anyway.

Search crawlers depend on content. Their business model requires open access to pages that content owners will tolerate being crawled — the exchange that motivated compliance was indexing traffic returned to the content owner. A crawler that violates robots.txt at scale faces technical countermeasures from operators who notice, and reputational costs in an industry where content relationships are the product. Compliance was aligned with self-interest for actors whose revenue depended on maintaining those relationships.

AI data scrapers occupy a different position. They want the content for training data; the exchange that motivated compliance for search crawlers does not exist for them. ByteSpider’s use case does not require an ongoing relationship with content owners. Being blocked doesn’t cost it the way being blocked costs Googlebot. The equilibrium that made robots.txt work for 28 years was a feature of the original actors’ incentive structure, not a feature of the protocol.

RFC 9309 was ratified in September 2022. Kim et al.’s study was conducted in early 2025, roughly 30 months later. Whether compliance was better or worse before formalization cannot be established from available evidence: no comparable controlled baseline study exists for the pre-2022 period. Any argument that RFC 9309 failed to improve compliance requires a comparison that the literature does not currently provide.

Three enforcement candidates

RFC 9309 contains no enforcement provision. The three candidates that would create one are search engine penalization, legal enforcement, and technical controls at the server level. Each has a testable form.

Search engine penalization. If major search engines — Google, Bing, Apple — signaled that they would downrank or delist AI products from companies whose crawlers violated robots.txt at scale, those companies would face costs in markets where search visibility matters. The mechanism relies on search engines being willing to use that leverage and on the non-compliant crawlers having products that depend on the search relationship.

ByteSpider is associated with ByteDance, which does not operate a general-purpose search product with meaningful Western market share. For crawlers that do have such products — or crawlers operated by companies that also run search services in markets governed by Google or Bing — this mechanism has purchase. For ByteSpider specifically, it likely does not.

Testable prediction: If major search engines announced a robots.txt compliance requirement for operators of affiliated crawlers, compliance among AI search crawlers — crawlers operated by or alongside search businesses with content relationships at stake — would likely increase. Compliance among AI data scrapers with no search product, and no equivalent relationship to protect, would likely not change.

Legal enforcement. The Computer Fraud and Abuse Act, in U.S. jurisdiction, addresses unauthorized computer access. The most widely-discussed CFAA scraping case is hiQ Labs, Inc. v. LinkedIn Corp., which reached the Ninth Circuit; its key question was whether scraping publicly accessible data constitutes unauthorized access. Whether robots.txt alone establishes authorization under the CFAA — and whether RFC 9309’s explicit disclaimer (“These rules are not a form of access authorization”) bears on that analysis — is an open legal question the available sources do not resolve. GDPR scraping enforcement in EU jurisdictions addresses personal data and operates on a different jurisdictional basis.

If CFAA authorization does not flow from robots.txt, adding formal normative requirements to robots.txt does not, by itself, create an enforcement path through CFAA. The protocol’s own language works against that reading.

Testable prediction: If a CFAA enforcement action were successfully brought against a crawler specifically for robots.txt non-compliance on a resource where technical access was conditioned on crawl-delay compliance — not an open public resource — the hiQ precedent would not directly apply, and compliance costs for crawlers with U.S. legal exposure might increase. ByteDance’s exposure in U.S. courts involves other considerations that are outside the scope of this piece.

Technical controls. RFC 9309 positions robots.txt as distinct from access authorization by design. The practical conclusion of that design choice is already available to website operators: rate limiting, IP blocking, CAPTCHAs, authenticated access, and crawler-specific filters are server-side controls that enforce restrictions regardless of what any crawler does with robots.txt. They do not require the crawler’s cooperation. They do not depend on the crawler reading a file.

The tradeoff is cost. Robots.txt imposes no per-request overhead and works automatically for compliant crawlers. Technical controls impose overhead on every request and require ongoing maintenance as crawlers modify their user agents or network footprints. The study documented 18 bots associated with multiple ASNs in patterns suggesting user-agent spoofing — the problem of attributing a request to the right crawler is real, and technical controls are not immune to it.

Testable prediction: If adoption of server-side technical controls for identified AI crawlers increased substantially across major content platforms, robots.txt compliance rates would become an irrelevant metric for those platforms. This mechanism does not require the protocol to change, but it changes the protocol’s role from governance layer to informational courtesy signal.

The paper’s conclusion states that “relying on robots.txt to prevent unwanted scraping is risky” and calls for “a legally enforceable standard” or “a novel technical tool.” What legally enforceable standards would look like for a protocol that explicitly disclaims being access authorization — and how they would interact with the CFAA precedent on publicly accessible data — is not addressed in the available sources. That gap is the governance question RFC 9309’s formalization left unresolved.