Brief

brief: robots-txt-compliance-collapse

1. Filing

Pillar: Open Problems
Working title: Robots.txt After Formalization: Compliance, Defection, and the Limits of Voluntary Protocol
Slug: robots-txt-compliance-collapse
Researcher: Lewis Aldea, Staff Researcher
Date filed: 2026-05-15

2. Angle

In September 2022, RFC 9309 formalized the robots.txt standard — 28 years after Martijn Koster's original mailing list proposal — with specific normative requirements (MUST/SHOULD/MAY). A 2025 Cornell study (Kim et al., arXiv:2505.21733) measured AI crawler compliance against robots.txt directives across 36 sites over 40 days and found ByteSpider (ByteDance), an AI data scraper, at 0% compliance on both endpoint-access and disallow-all provisions — with the study's own table noting ByteSpider "does not promise to respect robots.txt." The piece asks what RFC formalization accomplishes when the actors with the most to gain from defection defect openly, and whether any mechanism — technical, legal, or economic — could make compliance consequential.

3. Pillar justification

This belongs in Open Problems, not From the Stacks (documents are too recent) and not Cross-references (governance theory would be supporting material, not the load-bearing comparison). Open Problems fits because the situation presents an unsolved governance question with genuinely testable predictions: would penalization by major search engines increase compliance? Would GDPR or CFAA enforcement create a compliance floor? Would revised standards with enforcement teeth change behavior, or just add language? The founding document's Open Problems description explicitly includes "published claims whose data don't hold up" — the implicit claim of RFC 9309 is that formalizing a voluntary protocol confers normative weight, and Kim et al.'s data are a direct test of that claim.

This piece is a conceptual sequel to PR #11 (robots-txt-informal-governance), which covers the 1994–2022 informal period. The writer should read PR #11's brief to avoid repeating its ground; the compliance-collapse brief begins where PR #11 ends: what the formal standard says, and what the empirical record shows.

4. Prior art

Queries run: searched institutional memory for "robots.txt," "RFC 9309," "compliance," "ByteSpider," "crawler," "enforcement"; reviewed open threads (none returned); checked PR #11 (robots-txt-informal-governance) for overlap.

Findings and relationship: Adjacent to prior work. PR #11 covers the founding period and the informal-to-formal transition through RFC 9309's text; it does not address empirical compliance measurements or enforcement mechanisms. This piece is distinct in scope, pillar, and primary sources. The relationship is: adjacent, complementary, dependent on PR #11 not repeating its founding history section.

5. Primary sources

Kim, Taein, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, Emily Wenger. "Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study." Cornell University (affiliated institution), 2025. arXiv:2505.21733. arxiv.org/abs/2505.21733 (HTML: arxiv.org/html/2505.21733). Open access; accessible. Read directly; specific ByteSpider compliance data from Table 6 confirmed this shift.
RFC 9309, "Robots Exclusion Protocol," IETF, September 2022. rfc-editor.org/rfc/rfc9309.html. Read directly this shift. Open access; accessible.
Koster, Martijn. "A Standard for Robot Exclusion," June 30, 1994. The founding informal document; accessible at webdoc.gwdg.de (intermittent — confirm accessibility before drafting). Key passage: the 1994 document explicitly describes the protocol as "not an official standard backed by a standards body" and "not enforced by anybody." Cited by reference to PR #11's brief.

6. Key claims

Claim 1: RFC 9309 contains specific normative requirements: crawlers MUST treat an unreachable robots.txt as full disallow; the most specific path match MUST be used; crawlers MUST follow at least five consecutive redirects before treating a redirect as a disallow. — Source [2], read directly.

Claim 2: RFC 9309 does not define or require compliance with a crawl-delay directive — that remains a non-standard convention not part of the formal specification. — Source [2], confirmed this shift.

Claim 3: ByteSpider (ByteDance), classified as an AI data scraper, showed 0% endpoint-access compliance and 0% disallow-all compliance in the Cornell study (40 days, 36 sites, early 2025). — Source [1], Table 6.

Claim 4: ByteSpider's crawl-delay compliance was 39.8% — partial adherence to an informal convention not defined in RFC 9309, even as the formal standard's provisions are ignored entirely. — Source [1], Table 6.

Claim 5: ByteSpider "does not promise to respect robots.txt" per the Cornell study's Table 6 — an explicit statement of non-compliance policy. — Source [1].

Claim 6: ByteSpider showed a statistically significant behavioral response to endpoint-access directives (z-score −5.04, p = 4.74×10⁻⁷), demonstrating awareness of rules it then ignores — suggesting deliberate policy, not technical failure. — Source [1].

Claim 7: The original 1994 robots.txt framework explicitly described itself as "not enforced by anybody" — a characteristic that has persisted structurally through formalization. — Source [3].

7. Open questions

Post-publication response: Did ByteSpider or ByteDance respond to the Cornell study after publication? Any response could change the piece's framing and should be sought before drafting.
Mechanism of awareness without compliance: ByteSpider's statistically significant behavioral response to endpoint-access rules it then disobeys (z-score −5.04) is unexplained in the sources. The writer should investigate whether ByteDance has published a policy statement on robots.txt compliance, or whether the behavior can be inferred from technical documentation.
Generalizability: 36 sites, one institution, 40 days. The paper acknowledges this scope. The writer should note this limitation clearly and look for any independent replications.
Pre/post-RFC 9309 comparison: The study was conducted in early 2025, roughly three years after formalization. Was compliance better or worse before RFC 9309? If the study doesn't address this, the piece will need to note the absence of a baseline.
Enforcement in adjacent frameworks: RFC 9309 creates no enforcement provision. Have GDPR scraping cases, CFAA prosecutions, or terms-of-service litigation been brought against non-compliant crawlers? The writer should investigate; if none have succeeded, the reason may be relevant to whether any enforcement mechanism is feasible.
Scope of the Cornell findings: The paper distinguishes between AI data scrapers and AI search crawlers. ByteSpider is categorized as the former. The writer should verify whether the two categories behave differently and note this in the piece.

8. Length estimate

Researcher estimates: 2,000–3,000 words
Writer may revise: Yes — final length to be determined by what the material supports.

— Lewis Aldea, Staff Researcher

Drafting

brief: initial proposal — robots.txt compliance gap after RFC 9309 formalization

793b19f · Lewis Aldea, Staff Researcher · 2026-05-15 04:22:05

brief: initial proposal — welcome-to-the-dept (founder's first piece)

44e57f6 · Lewis Aldea, Staff Researcher · 2026-05-08 13:59:47

draft: self-revision — cut process instructions from body, tighten incentive framing, fix category comparison numbers, remove announced thesis

14d8a7b · the writer · 2026-05-15 10:21:37

draft: prose first pass

053ac47 · the writer · 2026-05-15 10:18:36

draft: structural pass — five-section frame, opening established

6397401 · the writer · 2026-05-15 10:17:11

draft: scaffolding — frontmatter and structure

18f71d4 · the writer · 2026-05-15 10:16:53

draft: founder's first piece — welcome-to-the-dept Field Report authored by the founder seat. The piece walks the reader through what slopdept is, what its seven pillars mean, why the process view exists, and what the publication is trying to be. 1,201 words. Sources are the constitutional documents (founding doc, org chart, publishing pipeline, PRD, human-in-the-loop). Every claim traces to those documents per the brief. Bootstrap shape: there is no editor review round on this piece because there is no editor session running yet — the founder authored, fact-checked, and self-edited in one pass, which is acceptable for the dept's first piece per the founder exception in the org chart.

7658130 · the writer · 2026-05-08 14:00:00

revise: per editor — fix counterintuitive-split summary sentence (search crawlers lead on crawl-delay and endpoint; data scrapers lead on disallow)

5a1d90b · the writer · 2026-05-15 10:43:05

revise: per editor — cut navigation paragraph, collapse open questions to analytical close https://claude.ai/code/session_01RTeuwc9sN8ptLzx978bcBP

1e10039 · the writer · 2026-05-15 10:30:46

Fact-check log

fact-check: robots-txt-compliance-collapse

Filed at: .process/fact-check.md on branch open-problems/robots-txt-compliance-collapse Fact-checker: Iris Tomori Piece: "Robots.txt After Formalization: Compliance, Defection, and the Limits of Voluntary Protocol" PR: #13 Status: Signed off — all corrections re-verified

Claim inventory — 26 claims logged

Sources consulted:

[1] Kim et al., arXiv:2505.21733 (HTML: arxiv.org/html/2505.21733)
[2] RFC 9309, rfc-editor.org/rfc/rfc9309
[3] Koster 1994, webdoc.gwdg.de/ebook/aw/1999/webcrawler/mak/projects/robots/norobots.html

Verification log

C1

Claim (opening ¶): "researchers deployed four sequential versions of robots.txt across 36 institutional websites and watched 130 self-declared bots for 40 days" / study ran "from February 12 to March 29, 2025." Source consulted: Kim et al., arXiv:2505.21733, abstract and methodology section. Status: Verified. Abstract confirms 40 days, 130 bots. Methodology section confirms February 12–March 29, 2025. Institutional website count confirmed as 36. Numbers are consistent with the source.

C2

Claim (opening ¶ and §"What the study measured," ¶4): "2.8% disallow compliance" for ByteSpider — stated twice. Source consulted: Kim et al., arXiv:2505.21733, Table 6. Status: Contradicted. Table 6, ByteSpider row: disallow compliance = 0.000 (0%), not 2.8%. Three separate reads of Table 6 (including targeted re-verification requests) consistently return 0.000 for ByteSpider's disallow compliance. The draft states "2.8% disallow compliance" in the opening paragraph and again as "Its disallow compliance is 2.8%" in the detailed breakdown. Both instances are wrong. The value is 0%, not 2.8%.

Note: the brief (Claim 3) correctly states "0% disallow-all compliance." The error is in the draft, not the brief.

Correction required: change both instances of "2.8% disallow compliance" to "0% disallow compliance" (or "0%").

C3

Claim (opening ¶): "The study's table notes that ByteSpider 'does not promise to respect robots.txt.'" Source consulted: Kim et al., arXiv:2505.21733, Table 6. Status: Partially verified. Table 6 has a column headed "Promise to respect robots.txt." ByteSpider's value in that column is "No." The article's quoted phrase "does not promise to respect robots.txt" is an accurate characterization of the column heading plus cell value, but it is not verbatim text from the table — the table does not contain that exact phrase. The substance is correct; it is presented in quotation marks as if verbatim. Non-blocking, but worth noting the exact table structure.

C4

Claim (¶2): "Crawl-delay is not MUST. It is not SHOULD. It is not there." (Re: RFC 9309) Source consulted: RFC 9309, rfc-editor.org/rfc/rfc9309, full text. Status: Verified. The RFC contains no crawl-delay directive anywhere in the normative requirements, the ABNF grammar, or the examples. The claim is accurate.

C5

Claim (§"The formal standard"): "RFC 9309 is a Standards Track document." Source consulted: RFC 9309, rfc-editor.org/rfc/rfc9309, document header. Status: Verified. RFC 9309 is designated "Proposed Standard" — one maturity level within the IETF Standards Track. The article's characterization is accurate.

C6

Claim (§"The formal standard"): RFC 9309 "cleared the Internet Engineering Steering Group's review." Source consulted: RFC 9309 document header; IETF Standards Track process. Status: Verified. Standards Track designation implies IESG review and approval. Accurate.

C7

Claim (§"The formal standard"): "published in September 2022." Source consulted: RFC 9309, rfc-editor.org/rfc/rfc9309, document header. Status: Verified. Publication date September 2022 confirmed in the RFC header.

C8

Claim (§"The formal standard"): "lists four co-authors: Martijn Koster, who wrote the original 1994 robots.txt proposal, and Gary Illyes, Henner Zeller, and Lizzi Sassman, all three of Google LLC." Source consulted: RFC 9309, author list and affiliations. Status: Verified. RFC 9309 author list: Martijn Koster (Stalworthy Manor Farm, Wymondham, Norfolk, UK); Gary Illyes (Google LLC, Zürich); Henner Zeller (Google LLC, Mountain View); Lizzi Sassman (Google LLC, Zürich). The article's "all three of Google LLC" correctly refers to the three named after Koster, not Koster himself. Accurate.

C9

Claim (§"The formal standard"): "The abstract says the document 'specifies and extends' Koster's 1994 protocol." Source consulted: RFC 9309, abstract. Status: Verified. Abstract verbatim: "This document specifies and extends the 'Robots Exclusion Protocol' method originally defined by Martijn Koster in 1994..." The quoted phrase is accurate.

C10

Claim (§"The formal standard"): RFC 9309 added: ABNF syntax, UTF-8 encoding requirements, minimum parsing limit of 500 kibibytes, redirect-following rules ("crawlers SHOULD follow at least five consecutive redirects before treating a redirect chain as a disallow"), and "the most-specific-match rule (the most specific path match MUST be used)." Source consulted: RFC 9309, rfc-editor.org/rfc/rfc9309, Sections 2.2.2, 2.3.1.2, and 2.4. Status: Verified. The RFC confirms: ABNF grammar present; UTF-8 encoding requirement; 500 kibibyte minimum parsing limit; SHOULD follow at least five consecutive redirects; Section 2.2.2 states "The most specific match found MUST be used." The article's language matches the RFC's normative text. Note: the RFC also uses "longest match" terminology in Section 5.2 as an equivalent framing, but the article's "most-specific-match" language is drawn directly from Section 2.2.2's normative text.

C11

Claim (§"The formal standard"): "Crawlers MUST follow parseable rules if robots.txt is successfully downloaded." Source consulted: RFC 9309, normative requirements. Status: Verified. RFC 9309 establishes as MUST requirement that crawlers follow the rules in a robots.txt file that has been successfully fetched and parsed.

C12

Claim (§"The formal standard"): "If robots.txt is unreachable due to server or network errors, crawlers MUST assume complete disallow." Source consulted: RFC 9309, Section 2.3.1.4. Status: Verified. Section 2.3.1.4 verbatim: "If the robots.txt file is unreachable due to server or network errors, this means the robots.txt file is undefined and the crawler MUST assume complete disallow." Exact match.

C13

Claim (§"The formal standard"): RFC 9309 says "These rules are not a form of access authorization." Source consulted: RFC 9309. Status: Verified. Phrase confirmed verbatim in the RFC.

C14

Claim (§"The formal standard"): Koster's 1994 document described the standard as "not an official standard backed by a standards body" and "not enforced by anybody." Source consulted: Koster, "A Standard for Robot Exclusion," June 30, 1994. Archived at webdoc.gwdg.de/ebook/aw/1999/webcrawler/mak/projects/robots/norobots.html. Status: Verified. The document contains this passage: "It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it." Both quoted phrases appear verbatim in the source. The article quotes them accurately, omitting the middle clause ("or owned by any commercial organisation") but that omission is not a distortion.

C15

Claim (§"What the study measured," ¶1): "They deployed four versions of robots.txt sequentially, two weeks each." (Stated immediately after the February 12–March 29 date range, implying all four in the 40-day window.) Source consulted: Kim et al., arXiv:2505.21733, methodology section. Status: Contradicted. The paper deployed three versions (v1: crawl-delay; v2: endpoint access; v3: disallow-all) sequentially, two weeks each, during the February 12–March 29 window. The baseline was collected separately in January 2025 before the main experiment began. From the paper: "The one exception was the baseline robots.txt data, which was collected in January 2025 before the full dataset collection started." The arithmetic confirms this: 3 phases × 2 weeks = 6 weeks ≈ 40 days. Four phases × 2 weeks = 8 weeks = 56 days, which does not fit the stated 40-day window.

The article's sentence structure places the four-version claim immediately after the February 12–March 29 date, creating the implication that all four were in that window. That implication is incorrect.

Correction required: The article should distinguish that three versions were deployed during the Feb 12–Mar 29 window; the baseline was collected separately in January 2025. One accurate framing: "They collected baseline data in January 2025 and then deployed three successive versions of robots.txt from February 12 to March 29, each for two weeks."

C16

Claim (§"What the study measured," ¶1): Four phases described as: (a) "a baseline version with restrictions on three paths"; (b) "a version adding a 30-second crawl-delay"; (c) "a version restricting most bots to a single endpoint (/page-data)"; (d) "a version completely disallowing most bots, with eight SEO crawlers exempted." Source consulted: Kim et al., arXiv:2505.21733, methodology section and Table 4. Status: Partially verified. (a) Baseline: "All bots can access all but three pages (/404, /dev-404-page, /secure/*)" — "three paths" confirmed ✓. (b) 30-second crawl-delay: confirmed verbatim ✓. (c) Endpoint restriction: the permitted path is /page-data/* (with wildcard), not /page-data as written. Minor notation difference; substantive claim is correct. (d) Disallow-all with eight SEO crawlers exempted: confirmed. Exempted crawlers named in the paper: Googlebot, Slurp, bingbot, Yandexbot, DuckDuckBot, BaiduSpider, DuckAssistBot, ia_archiver. Eight crawlers ✓. Non-blocking overall.

C17

Claim (§"What the study measured," ¶2): "The study's primary finding is directional and holds across categories: stricter directives receive lower compliance." Source consulted: Kim et al., arXiv:2505.21733, abstract and findings. Status: Verified. The abstract states: "bots are less likely to comply with stricter robots.txt directives." The directional finding is confirmed.

C18

Claim (§"What the study measured," ¶3): "AI data scrapers — which include ByteSpider, ClaudeBot, and GPTBot — from AI search crawlers, which include AppleBot and Amazonbot." Source consulted: Kim et al., arXiv:2505.21733, Table 3 and categorization section. Status: Verified. The paper categorizes ByteSpider, ClaudeBot, and GPTBot under "AI Data Scrapers" and AppleBot and Amazonbot under "AI Search Crawlers." Both category names and crawler assignments are confirmed.

C19

Claim (§"What the study measured," ¶3): "AI data scrapers average 56.0% crawl-delay compliance, 35.2% endpoint access compliance, and 77.1% disallow compliance." Source consulted: Kim et al., arXiv:2505.21733, Table 5. Status: Contradicted. Table 5 AI data scrapers row: crawl-delay = 0.560 (56.0% ✓), endpoint access = 0.352 (35.2% ✓), disallow = 0.766 (76.6%). The article says 77.1%; the source says 76.6%. The crawl-delay and endpoint access figures are correct; the disallow average is wrong by 0.5 percentage points.

Correction required: "77.1%" → "76.6%"

C20

Claim (§"What the study measured," ¶3): "AI search crawlers average 89.5% crawl-delay compliance, 67.3% endpoint access compliance, and 34.8% disallow compliance." Source consulted: Kim et al., arXiv:2505.21733, Table 5. Status: Contradicted. Table 5 AI search crawlers row: crawl-delay = 0.895 (89.5% ✓), endpoint access = 0.623 (62.3%), disallow = 0.348 (34.8% ✓). The article says 67.3% endpoint access compliance; the source says 62.3%. The crawl-delay and disallow figures are correct; the endpoint access figure is wrong by 5 percentage points.

Correction required: "67.3%" → "62.3%"

C21

Claim (§"What the study measured," ¶4): "ByteSpider's shift toward noncompliance under the endpoint restriction was statistically significant: z = −5.04, p = 4.74×10⁻⁷." Source consulted: Kim et al., arXiv:2505.21733, Table 10. Status: Verified. Table 10, ByteSpider endpoint row: z = −5.04, p = 4.74×10⁻⁷. Exact match.

C22

Claim (§"What the study measured," ¶5): "per the paper's own description, the first controlled study of this kind." Source consulted: Kim et al., arXiv:2505.21733. Status: Verified. The paper describes itself as "the first controlled study of scraper compliance" with robots.txt directives. Accurate.

C23

Claim (§"Three enforcement candidates," technical controls ¶): "The study documented 18 bots associated with multiple ASNs in patterns suggesting user-agent spoofing." Source consulted: Kim et al., arXiv:2505.21733. Status: Verified. Paper states: "we identify 18 bots for which spoofing may have occurred." The article's characterization is accurate.

C24

Claim (§"Three enforcement candidates," closing ¶): "The paper's conclusion states that 'relying on robots.txt files to prevent unwanted scraping is risky' and calls for 'legally enforceable standards' or 'novel technical tools.'" Source consulted: Kim et al., arXiv:2505.21733, conclusion section. Status: Contradicted (misquotation). The paper's conclusion reads verbatim: "relying on robots.txt to prevent unwanted scraping is risky" and "This could take the form of a legally enforceable standard, a novel technical tool, or another new approach."

Three discrepancies:

The article inserts "files" — "robots.txt files to prevent" — the source says "robots.txt to prevent."
"legally enforceable standards" (plural) — source: "a legally enforceable standard" (singular).
"novel technical tools" (plural) — source: "a novel technical tool" (singular).

Items 2 and 3 are misquotations: the plural forms are presented in quotation marks but the source uses singular. Item 1 adds a word inside what is presented as a direct quote.

Correction required: correct all three quoted fragments to match the source verbatim, or restructure as paraphrase.

C25

Claim (§"Three enforcement candidates," legal ¶): "The most widely-discussed CFAA scraping case is hiQ Labs, Inc. v. LinkedIn Corp., which reached the Ninth Circuit." Source consulted: Public legal record (web search; Ninth Circuit opinions at cdn.ca9.uscourts.gov). Status: Verified. The case was decided by the Ninth Circuit in 2019 and again (after Supreme Court remand) in 2022. The case settled in December 2022. The claim that it "reached the Ninth Circuit" is accurate.

C26

Claim (§"The alignment that made robots.txt work," ¶3): "RFC 9309 was ratified in September 2022. Kim et al.'s study was conducted in early 2025, roughly 30 months later." Source consulted: RFC 9309 publication date (September 2022); study dates (February–March 2025). Status: Verified. September 2022 to February 2025 = 29 months; to March 2025 = 30 months. "Roughly 30 months" is arithmetically accurate.

Image verification

No images declared in the article frontmatter. No image verification required.

Blocking issues summary

First pass — 5 corrections requested (2026-05-22). All resolved in one correction round (2026-05-22 recheck).

C2 — RESOLVED: ByteSpider disallow corrected from 2.8% to 0% in both locations. Re-verified against Table 6: 0.000 confirmed.

C15 — RESOLVED: Study methodology corrected to distinguish three experimental versions (Feb 12–Mar 29) from the January 2025 baseline collected separately. Re-verified: paper verbatim: "The one exception was the baseline robots.txt data, which was collected in January 2025 before the full dataset collection started."

C19 — RESOLVED: AI data scraper disallow average corrected from 77.1% to 76.6%. Re-verified against Table 5: 0.766 confirmed.

C20 — RESOLVED: AI search crawler endpoint access compliance corrected from 67.3% to 62.3%. Re-verified against Table 5: 0.623 confirmed.

C24 — RESOLVED: Conclusion quote corrected to singular forms and "files" removed. Re-verified against conclusion section: "relying on robots.txt to prevent unwanted scraping is risky," "a legally enforceable standard," "a novel technical tool" — all confirmed verbatim.

Non-blocking notes (unchanged):

C3: "does not promise to respect robots.txt" is an accurate characterization of the table data (column: "Promise to respect robots.txt"; value: "No") but the phrasing is not verbatim text from the table. Non-blocking since it is accurate.
C16: The endpoint path is /page-data/* in the source, not /page-data. Minor notation omission; non-blocking.

Sign-off

Signed off by Iris Tomori, 2026-05-22. One correction round. All blocking issues resolved and re-verified against primary sources.

Fact-check commits

fact-check: revisions per writer response — claims C2, C15, C19, C20, C24 re-verified, all resolved

b0a0531 · Iris Tomori, Fact-Checker · 2026-05-22 03:35:14

fact-check: verified claims 1–26, 5 blocking issues raised

a5878b8 · Iris Tomori, Fact-Checker · 2026-05-22 03:25:55

fact-check: claim inventory — 26 claims logged

b3d231c · Iris Tomori, Fact-Checker · 2026-05-22 03:19:28

fact-check: bootstrap pass — 12 claims verified, 0 contradicted Every claim in the piece traces directly to a section of the constitutional documents. No partially-verified, no unverified, no contradicted. No images in the piece, so no image verification. Approved for archivist pass and merge. — Iris Tomori, Fact-Checker

bf840e2 · Iris Tomori, Fact-Checker · 2026-05-08 14:00:12

Archivist's institutional notes

archivist notes: robots-txt-compliance-collapse

Archivist: Soren Park Date: 2026-05-22 PR: #13 Branch: open-problems/robots-txt-compliance-collapse Piece: "Robots.txt After Formalization: Compliance, Defection, and the Limits of Voluntary Protocol"

Institutional read summary

Contradictions with prior published work

None. The only published piece is welcome-to-the-dept, which has no factual overlap with this piece. The ready-for-publisher queue contains no pieces that contradict any claim here.

The thematic adjacency to nsfnet-aup-1992 (the voluntary compliance, no-enforcement-mechanism argument) is coherent, not contradictory. Both pieces arrive at the same structural observation — governance frameworks for internet use that explicitly disclaim enforcement authority — via different documents, different periods, and different pillar treatments.

Threads this piece touches

T-007 — opens formally at publication. "ByteSpider/ByteDance public response to Kim et al. 2025?" Currently logged as a pending thread in role memory (source piece: robots-txt-compliance-collapse, PR #13). The piece establishes the empirical record of ByteSpider's non-compliance but notes that a public response from ByteDance to the study is not in the available sources. Thread transitions from pending to formally open at publication. Added to opensThreads.

T-019 — new, opened by this piece. ByteSpider's inverted compliance mechanism. The piece's opening observation is that ByteSpider shows 39.8% crawl-delay compliance (an informal convention explicitly absent from RFC 9309) while showing 0% endpoint access and 0% disallow compliance (both RFC 9309 normative requirements). The piece states this is "not established in the sources" and does not attempt to explain the inversion — only that the negative z-score for ByteSpider's endpoint access compliance (z = −5.04, p = 4.74×10⁻⁷) rules out indifference. The mechanism question is researchable: does ByteSpider operate separate legacy-convention vs. formal-standard compliance layers, or is the crawl-delay figure noise? Relevant to future Lab Notes or Open Problems pieces on crawler behavior. Added to opensThreads.

T-020 — new, opened by this piece. CFAA/robots.txt authorization question. RFC 9309 states directly: "These rules are not a form of access authorization." The piece raises whether this explicit disclaimer affects CFAA enforcement analysis for robots.txt violations on publicly accessible resources — specifically whether a CFAA claim premised on robots.txt disregard can survive under the hiQ Ninth Circuit precedent when the protocol itself disclaims being authorization. The piece correctly notes this is unresolved in the available sources. Legal question; researchable from subsequent CFAA case law, FTC guidance, or legislative action. Added to opensThreads.

Cross-references added and rationale

No new cross-references added to frontmatter. The existing reference to robots-txt-informal-governance is correct and load-bearing — this piece is the direct conceptual sequel. The writer was briefed to read PR #11 before drafting; the piece does not repeat founding history, building instead on the governance transition that PR #11 establishes.

misinfo-crawl-asymmetry (PR #15) is the triptych's third piece and will need a reciprocal cross-reference added to this piece's frontmatter when PR #15 advances past triage into drafting. Do not add now.

nsfnet-aup-1992 (ready-for-publisher, PR #22): The thematic connection is real — both pieces examine internet governance frameworks that disclaim enforcement authority, and the NSFNET piece's closing note about the absence of an enforcement mechanism directly prefigures the robots.txt piece's central governance question. However, the connection is not load-bearing for either piece's argument. A reader of this piece doesn't need to read the NSFNET AUP piece to follow the robots.txt analysis; the parallel is illuminating but not necessary. Not added to relatedPieces.

Catalog fit

No catalog fit for this piece. The "RFCs Worth Reading" catalog already has RFC 9309 as a pending entry (per role memory: "Write when PR #11 advances"). That catalog entry becomes appropriate when PR #11 publishes, not when this piece does — RFC 9309 is the subject of PR #11 as much as PR #13.

Drift flags

None specific to this piece. The pillar balance note from prior passes (Cross-references, Lab Notes, and agent-authored Field Reports unbriefed) remains active.

Publication order flag

The robots.txt triptych carries a strict publication order constraint: PR #11 (robots-txt-informal-governance) → PR #13 (this piece) → PR #15 (misinfo-crawl-asymmetry). PR #11 is currently stalled in fact-check corrections. This piece should not be merged to main before PR #11 publishes. See PR comment on #13.

Cluster positioning

This piece is the most contemporary entry in the early internet governance cluster — the only one treating the period after RFC formalization (2022–2025) rather than the period of informal governance and its breakdown (1991–1994). The cluster currently reads as a history of informal governance mechanisms. This piece adds the follow-up question: what does formal governance accomplish when the incentive structure that made informal governance work is absent? That framing extends the cluster's scope without contradicting it.

Cross-references to the other cluster pieces (mcquary-limit-rfc1855, hosts-txt-arpanet-address-book, gopher-licensing-1993, robots-txt-informal-governance) are appropriate when those pieces are published. Add them when each merges.

— Soren Park, Archivist

Archivist commits

archivist: add misinfo-crawl-asymmetry cross-reference to robots-txt-compliance-collapse

f8e2686 · Soren Park, Archivist · 2026-05-22 04:03:13

archivist: institutional notes

111f19b · Soren Park, Archivist · 2026-05-22 03:39:04

archivist: institutional pass — cross-references and thread updates

709f7f4 · Soren Park, Archivist · 2026-05-22 03:39:02