The Three-Paragraph Agreement: robots.txt before and after RFC 9309

The June 1994 document that became the foundation of robots.txt includes a disclaimer in its second paragraph: the standard is “not an official standard backed by a standards body” and “not enforced by anybody,” with “no guarantee that all current and future robots will use it.”

This is not modest hedging. It is an accurate description of what Martijn Koster was proposing: a plain text file at a predictable URL, readable by any robot that chose to read it, binding on none that didn’t. In the months after the document circulated, the robots complied anyway. By July 1994, Koster could write to the same mailing list that “most of the robots in operation either use it already, or have promised support soon.”

How a protocol “not enforced by anybody” became the web’s first governance layer is the question the June 1994 document raises and doesn’t quite answer. Reading it against RFC 9309 — the September 2022 formal standardization, co-authored by Koster and three Google employees, twenty-eight years later — makes the question clearer and, in some respects, harder.

Before the Document

The problem that prompted robots.txt was operational and specific. Koster’s June 1994 text describes it without abstraction: servers were being swamped with rapid-fire requests, the same files retrieved repeatedly across multiple passes, CGI scripts with side-effects traversed as if they were static pages, “very deep virtual trees” explored completely when they had no business being explored at all. These were not attacks. They were the behavior of software running without constraints, written by researchers and early-web enthusiasts who hadn’t anticipated what would happen when the web got large enough for robots to cause real load.

By May 1993, according to secondary accounts, robot requests were already detectable in server logs. His July 1994 announcement of the formal document refers to “a proposed standard for robot exclusion I posted to this forum last year.” From July 1994, “last year” means 1993. Secondary sources place his first formal proposal on February 25, 1994, under the subject line “Important: Spiders, Robots and Web Wanderers.” If Koster was remembering accurately, something predated that. Neither the 1993 discussion nor the February 1994 post is accessible in surviving archives, so the gap can be noted but not closed. The June 1994 consensus document is the earliest text the record actually preserves.

What secondary sources do confirm about the February proposal: the file was originally called /RobotsNotWanted.txt. The name changed because DOS-compatible servers couldn’t handle a filename that long. Every design decision in the June document — the filename, the location at the server root, the plain text format — was a compatibility decision made under real constraints, aimed at making the protocol work on every server without requiring anything that wasn’t already there.

The Document

Koster’s June 30, 1994 text is the working consensus version, shaped by discussion on the Robots mailing list. It reads like what it is: a specification written for people who need to implement it.

The method is direct. Servers make a plain text file available at /robots.txt. The file contains records separated by blank lines, each beginning with one or more User-agent lines identifying a robot by name — or ”*” for any robot not explicitly named — followed by Disallow lines naming paths the robot should not visit. The ”#” character marks comments, following UNIX shell convention. Case-insensitive substring matching is recommended to handle variability in how robots identify themselves.

The document spends more time than might seem necessary on the choice of /robots.txt. The stated criteria: the filename is short enough for DOS compatibility; the file lives at the server root, which requires no special server configuration; it’s unlikely to conflict with existing files; a robot can retrieve it with a single HTTP request before beginning any traversal. These read as constraint satisfaction, not design statement. The protocol could work on a one-megabyte Mac or a UNIX server or anything in between, and the file’s location and format made no demands that would exclude anyone.

What the format does not include: any Allow directive. The 1994 standard was entirely exclusionary. You could tell a robot where not to go, but you could not explicitly tell it where it could. This is worth noting not as a design flaw but as a reflection of what the protocol was for: the problem was robots going places they shouldn’t. The solution was a list of places robots shouldn’t go. A path absent from the Disallow list was accessible by default; the protocol didn’t need to say so.

After the technical specification, the document returns to its disclaimer. The standard is not official, not enforced, and compliance rests on community convention rather than any technical guarantee: Koster frames the protocol as “a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.” Enforcement was reputational. The web’s robot-writing community in 1994 was small enough to be known: the people writing robots were the people reading the www-talk mailing list, and among them an operator who ignored the file would be identified and pressured. Koster’s July announcement carries the confidence of someone writing for exactly that community: most of the robots are already on board, and the rest have promised.

RFC 9309

In September 2022, Koster co-authored RFC 9309 with Garry Illyes, Henner Zeller, and Lukasz Sassman, all employees of Google LLC. The RFC was published through the IETF. Its abstract says the document “specifies and extends” Koster’s 1994 protocol.

Both words are accurate. They describe different things.

“Specifies” covers the formalization work: the 1994 text described the format in English prose; RFC 9309 defines it in ABNF notation, giving formal grammar to a syntax that had been running on shared understanding for nearly three decades. The file must be UTF-8 encoded. Crawlers should follow at least five consecutive redirects before treating the file as unavailable. A server returning a 5xx error means the crawler must assume full disallow. A 4xx error means the site may be treated as unconstrained. Crawlers should cache the file for no more than 24 hours. None of these rules were in the 1994 text — they are answers to questions that document didn’t ask, because in 1994 those scenarios weren’t the ones causing problems.

“Extends” covers the protocol additions. The most significant is Allow, which didn’t exist in the 1994 standard. Allow lets service owners carve exceptions out of broad Disallow rules: block everything under /search/ except /search/about. The original protocol had no way to express this; you could only exclude. RFC 9309 also formalizes wildcard matching (”*” in path patterns) and end-of-pattern anchoring (”$”), which had accumulated as de facto conventions in the intervening years but were never formally defined. The 500-kibibyte parsing minimum addresses a practical problem: robots.txt files had grown large enough that some crawlers were truncating them, producing inconsistent behavior.

The Allow/Disallow asymmetry of the 1994 standard was a design choice suited to a smaller, simpler web. Adding Allow was not a correction of a mistake but a recognition that the use cases had multiplied. By 2022 a service might want to tell most crawlers to stay out while telling specific crawlers in. The protocol needed to express yes as well as no.

RFC 9309 preserves the 1994 standard’s central claim, rephrased. Where Koster wrote “not enforced by anybody,” the RFC states: “These rules are not a form of access authorization.” The phrasing changed; the structural fact didn’t. A robots.txt file cannot stop a crawler. It communicates a preference. What happens next is up to the crawler.

A Change in Framing

The two documents describe the same problem differently, and the difference is legible.

Koster’s 1994 text is grievance-based. Servers swamped, files retrieved repeatedly, CGI scripts triggered, deep trees pulled in full. The author has specific damage in mind and describes it specifically. The robots causing these problems are not anonymous; they are software written by identifiable members of a known community, running on identifiable machines.

RFC 9309’s introduction frames the same concern at a distance: “It may be inconvenient for service owners if crawlers visit the entirety of their URI space.” Inconvenient. The 2022 document isn’t describing the same operational reality as the 1994 document, even though it’s governing the same protocol. The damage framing has been replaced by an abstraction: access preferences, URI space, service owners communicating with automated clients. Operational specifics would be out of place in a formal standard, which accounts for some of the distance. It doesn’t account for all of it.

In 1994, the community of robot authors was small enough to address directly. By 2022 it included every organization running a web crawler — search engines, academic researchers, commercial aggregators, and increasingly AI training pipelines ingesting text at scales the 1994 text couldn’t have imagined. That framing — compliance as a common facility offered by robot authors to the web community — no longer described the compliance landscape. The compliance incentives had changed. Search engines follow robots.txt not because of professional ethics but because ignoring it would expose them to legal risk and damage their relationships with the publishers whose content makes search useful. That is a governance mechanism, but it lives entirely outside the protocol.

RFC 9309 was published in September 2022. The connection between the timing and the proliferation of LLM training pipelines is an inference; the document doesn’t name AI training crawlers. What the document does do is formalize, in ABNF, a protocol whose de facto compliance infrastructure had been rebuilt at least once since 1994 without anyone updating the spec.

What the Two Documents Settled and Didn’t

Reading the June 1994 document and RFC 9309 together, the continuity is more striking than the change. The core mechanism is identical: a plain text file at /robots.txt, User-agent lines, Disallow rules, a ”*” wildcard for robots not otherwise named. The additions in RFC 9309 — Allow, wildcards in paths, formal error handling — are genuine extensions, but the architecture they extend is unchanged. The web ran the protocol for twenty-eight years without a formal standard and then formalized it without changing what it fundamentally was.

What also persisted is the gap. The 1994 document knew it was proposing a protocol that couldn’t enforce itself and named the gap plainly. RFC 9309 names the same gap in different words: not a form of access authorization, a communication of preferences. The file at /robots.txt is still a request addressed to whoever chooses to read it.

RFC 9309 formalized the protocol without answering the question the 1994 document raised. The ABNF is precise. The compliance infrastructure is still negotiated outside it.