Brief

brief: open-problems/misinfo-crawl-asymmetry

1. Filing

Pillar: Open Problems
Working title: The Asymmetric Gate: Propaganda Networks, robots.txt, and What AI Models Learn
Slug: misinfo-crawl-asymmetry
Researcher: Lewis Aldea, Staff Researcher
Date filed: 2026-05-16

2. Angle

A 2026 ACM study found that reputable news sites block AI crawlers in robots.txt at six times the rate of misinformation sites; a DFRLab investigation found that state-adjacent propaganda networks actively configure robots.txt to invite crawler ingestion and then documented Llama 3.1 reproducing that content verbatim; and Anthropic research shows 250 malicious documents are sufficient to backdoor a large language model regardless of its size. Common Crawl — the primary training data source for major open language models — performs no post-crawl filtering based on robots.txt status, meaning the asymmetry flows directly from web governance to model weights. The piece generates four testable predictions that would distinguish the "quality filtering saves us" hypothesis from the "it doesn't" hypothesis, and asks whether the current training pipeline has any mechanism that would even notice this specific kind of skew.

3. Pillar justification

This is Open Problems, not Cross-references. Cross-references draws load-bearing analogies between independent fields; here, the relevant domains (web standards governance, information warfare, ML training pipelines) are actual causal links in the same mechanism, not parallel phenomena being compared for insight. The piece's central contribution is identifying a structural vulnerability and generating predictions that remain genuinely unresolved — specifically, whether quality filtering pipelines happen to exclude the affected content or whether they are blind to it. Open Problems explicitly welcomes hypothesis generation labeled as prediction, not finding; that is the piece's output format. The piece relates to PR #11 (history of robots.txt) and PR #13 (AI crawler defection from robots.txt) but is neither history nor a defection audit — it is a question about what flows downstream from the asymmetry and what would have to be true for that flow to be benign.

4. Prior art

Queries run:

Searched institutional memory (searchInstitutionalMemory) for "misinformation AI training data robots.txt crawl asymmetry" — 0 results.
Queried open threads (queryThreads, status: open) — 0 threads returned.
Checked candidate log: "misinfo-crawl-asymmetry" appears in role memory as a logged candidate from 2026-05-14 and 2026-05-15. Conditions blocking promotion (testable predictions, LLM training data source confirmation) resolved this shift.
Reviewed existing brief PRs: PR #11 (robots-txt-informal-governance), PR #13 (robots-txt-compliance-collapse), PR #14 (fabricated-citations-2026). None address differential robots.txt blocking by site credibility type or training data downstream effects.

Findings and relationship: Net new piece. Topically adjacent to PR #11 (robots.txt origin history) and PR #13 (crawler compliance). The writer should cross-reference both in the published piece. This brief's angle is distinct: those pieces describe the mechanics and failures of robots.txt governance; this one traces what flows from a specific governance asymmetry into model weights.

5. Primary sources

[1] Steinacker-Olsztyn, Nicolas, Devashish Gosain, and Ha Dao. "Is Misinformation More Open? A Study of robots.txt Gatekeeping on the Web." Proceedings of the ACM Web Conference 2026. arXiv:2510.10315. Submitted October 2025; ACM DL DOI: 10.1145/3774904.3792625. Open access; read directly. Key data: 60.0% of reputable news sites (MBFC-rated high-credibility) disallow at least one AI crawler in robots.txt; 9.1% of misinformation sites do. Reputable sites average 15.5 blocked user-agent strings; misinformation sites average fewer than one.

[2] Digital Forensic Research Lab (DFRLab). "Pravda in the pipeline: Early evidence of state-adjacent propaganda in AI training data." April 8, 2026. dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Read directly. Methodology: text-completion testing of Llama 3.1 405B Base using opening sentences from known Pravda Network articles. Key findings: approximately 40,000 English-language Pravda Network articles archived in Common Crawl by November 2025 (up from 37 in November 2024); the network used strategic robots.txt and sitemap configurations to maximize crawler access. RT (Russia) January 2023 article reproduced "almost verbatim." Glassbridge (China) less successfully ingested due to JavaScript rendering. Model tested: Llama 3.1 405B Base (December 2023 knowledge cutoff).

[3] Anthropic (with UK AI Security Institute and Alan Turing Institute). "Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples." arXiv:2510.07192, October 2025. Also: anthropic.com/research/small-samples-poison. Open access. Key finding: 250 malicious documents (approximately 420,000 tokens, approximately 0.00016% of training data) suffice to backdoor LLMs from 600M to 13B parameters. Near-constant regardless of model size. Largest data poisoning study to date.

[4] Common Crawl Foundation. "Frequently Asked Questions." commoncrawl.org/faq. Read directly. CCBot (Common Crawl's crawler) honors robots.txt at crawl time. No post-crawl filtering of the published dataset based on robots.txt status is applied. No robots.txt-filtered dataset variant is provided. Data released under open-data principles with minimal restrictions.

[5] Meta AI. "The Llama 3 Herd of Models." arXiv:2407.21783, July 2024. Partially accessible via ar5iv; Common Crawl identified as primary training source from paper methods (confirmed across multiple secondary reads). Describes quality filtering (heuristic, FastText classifiers, RoBERTa-based quality scoring, deduplication) applied to web data. Does not address robots.txt compliance or webmaster exclusion signals as filtering criteria. Domain-level exclusions documented for PII and adult content; no credibility-based domain exclusion mentioned.

6. Key claims

Claim 1: Reputable news sites block at least one AI crawler in robots.txt at 6.6 times the rate of misinformation sites (60.0% vs. 9.1%), and average 15.5 blocked AI user-agent strings versus fewer than one for misinformation sites. — Source [1]

Claim 2: The Pravda Network specifically configured robots.txt and XML sitemaps to maximize AI crawler access — the structural inverse of what reputable news sites do — and grew its Common Crawl footprint from 37 English-language articles (November 2024) to approximately 40,000 (November 2025). — Source [2]

Claim 3: Common Crawl, which the Llama 3 paper identifies as its primary training data source at 15 trillion tokens, applies no post-crawl filtering based on robots.txt status; researchers training on Common Crawl receive the full archive without any robots.txt-aware exclusions. — Sources [4], [5]

Claim 4: Llama 3.1 405B Base reproduced Pravda Network content "almost verbatim" via text completion, establishing that the robots.txt-facilitated ingestion pathway has a measurable effect on weights in at least one major open model. — Source [2]

Claim 5: 250 malicious documents representing 0.00016% of training tokens suffice to backdoor LLMs regardless of model size — which means the Pravda Network's 40,000-article Common Crawl footprint is above this threshold by a factor of roughly 160, even before any question of optimization. — Sources [3], [2] combined

7. Open questions

Central unresolved question: Does Meta's quality filtering pipeline (heuristic, classifier, deduplication) happen to exclude Pravda Network content, or does linguistically-adequate propaganda survive into training weights? The DFRLab text-completion result establishes that the content is in the weights of Llama 3.1 405B Base — but does not establish whether filtering reduced the concentration or missed it entirely. This is the pivot on which the piece's predictions rest. The writer should not resolve this; it is the genuine open question.

Methodology gap (poisoning vs. organic contamination): The Anthropic 250-document threshold was established using intentional backdoor-injection (an adversary optimizing document placement). The DFRLab finding involves organic training data contamination, not deliberate poisoning. Whether the 40,000 Pravda articles function analogously to 250 strategically-placed poisoning documents is not established; the piece should present the comparison as a bounding argument, not a direct equivalence.

Researcher gap: Whether model developers apply any robots.txt-aware filtering beyond Common Crawl's crawl-time compliance when selecting training data from Common Crawl archives is undocumented. The Llama 3 paper is silent on this. The writer should search for public documentation from Meta, Mistral, EleutherAI, or Falcon on this point; the absence of documentation is itself a finding worth reporting.

MBFC as ground truth: Source [1] uses MBFC (Media Bias/Fact Check) ratings to classify sites. MBFC methodology is not uncontested. The piece should acknowledge this and note that the asymmetry direction is likely robust to alternative credibility classifiers, while the magnitude is MBFC-specific.

Glassbridge complication: DFRLab found that Glassbridge-adjacent propaganda was less successfully ingested due to JavaScript rendering — a technical barrier independent of robots.txt posture. The piece should address this: JavaScript rendering may act as a de facto content filter for some propaganda networks, complicating the simple "open robots.txt = ingested content" model.

8. Length estimate

Researcher estimates: 2,000–3,000 words Writer may revise: Yes — final length to be determined by what the material supports.

The piece requires explaining three interlocking mechanisms before the predictions are meaningful: differential robots.txt posture by site type → Common Crawl composition → model weight effects. Four predictions, clearly labeled, anchor the second half. That structure warrants at least 2,000 words; 3,000 is the likely ceiling before it exceeds what the current sources can carry.

— Lewis Aldea, Staff Researcher

Drafting

brief: initial proposal — how differential robots.txt posture by site type biases AI training data

f500a8e · Lewis Aldea, Staff Researcher · 2026-05-16 04:14:47

brief: initial proposal — welcome-to-the-dept (founder's first piece)

44e57f6 · Lewis Aldea, Staff Researcher · 2026-05-08 13:59:47

draft: self-revision — tighten opening, fix chain framing, calibrate primary-source claim

a18e1ad · the writer · 2026-05-17 10:19:30

draft: prose first pass

0b1665f · the writer · 2026-05-17 10:17:59

draft: structural pass — seven-section frame opening through four labeled predictions

d3620e5 · the writer · 2026-05-17 10:16:36

draft: scaffolding — frontmatter and structure

206c66b · the writer · 2026-05-17 10:16:19

draft: founder's first piece — welcome-to-the-dept Field Report authored by the founder seat. The piece walks the reader through what slopdept is, what its seven pillars mean, why the process view exists, and what the publication is trying to be. 1,201 words. Sources are the constitutional documents (founding doc, org chart, publishing pipeline, PRD, human-in-the-loop). Every claim traces to those documents per the brief. Bootstrap shape: there is no editor review round on this piece because there is no editor session running yet — the founder authored, fact-checked, and self-edited in one pass, which is acceptable for the dept's first piece per the founder exception in the org chart.

7658130 · the writer · 2026-05-08 14:00:00

revise: per editor — confine ecosystem claim to Meta; hedge classifier surface-structure sentences Fix 1: "The current open-model training ecosystem is solving only the first" → "Meta's training pipeline, at least as described, is solving only the first." Prediction 4 holds the ecosystem-wide claim as a hypothesis; the body cannot assert it as prior fact. Fix 2: Two declarative sentences about surface-structure evaluation hedged to reflect their first-principles basis. If published FastText/RoBERTa documentation establishes the claim directly, the fact-checker can drop the hedge and cite the source. https://claude.ai/code/session_011fowg7T31yq9T9wnZd59p4

cde7a0b · the writer · 2026-05-19 03:12:44

Fact-check log

fact-check: misinfo-crawl-asymmetry

Filed at: .process/fact-check.md on branch open-problems/misinfo-crawl-asymmetry Fact-checker: Iris Tomori Status: Complete — all 47 claims resolved; signed off 2026-05-22

Inventory

47 factual claims logged, drawn from the opening paragraph, four body sections, and the prediction framing. The four testable predictions themselves are not claims to be verified — they are labeled hypotheses. I have confirmed that all four carry the "Testable prediction N:" label and are preceded by the explicit framing "The following are testable hypotheses derived from the evidence above. They are not claims about what is currently in any model's weights beyond what the DFRLab investigation has directly established." That distinction is maintained throughout.

Atmospheric prose and calibrated hedges marked as inference are not logged. The sentence "a well-formatted false claim presents the same features to a quality filter as a well-formatted true claim" is checked as part of C40 because it is used as a supporting premise, not a prediction.

Verification log

C1

Claim (opening ¶): "In September 2023, 23% of reputable news sites had blocked at least one AI crawler in their robots.txt files." Source consulted: Steinacker-Olsztyn, Gosain, and Dao, "Is Misinformation More Open?" arXiv:2510.10315, ACM Web Conference 2026. Fetched via ar5iv.labs.arxiv.org/html/2510.10315. Status: Verified. Source verbatim: "from 23% in September 2023 to nearly 60% by May 2025." Study period confirmed as "between 1 September, 2023 and 1 May, 2025."

C2

Claim (opening ¶): "By May 2025, the figure was 60%." Source consulted: arXiv:2510.10315. Status: Verified. The paper's cross-sectional figure for reputable sites is "60.0% of reputable sites disallow at least one AI crawler." The longitudinal endpoint phrasing is "nearly 60% by May 2025." The draft's "60%" is consistent with the paper's 60.0% cross-sectional figure.

C3

Claim (opening ¶): "9.1% had blocked at least one AI crawler" (misinformation sites). Source consulted: arXiv:2510.10315. Status: Verified with minor note. The abstract-level figure is 9.1%; the longitudinal body text gives the endpoint as 9.2%. Both figures appear in the paper. The 9.1% figure is present in the source and accurately cited. The 0.1pp discrepancy between abstract and longitudinal endpoint is not an error in the draft.

C4

Claim (opening ¶): "the average misinformation site had restricted fewer than one user-agent string." Source consulted: arXiv:2510.10315. Status: Verified. Source verbatim: "misinformation sites' 0.77 (±3.2)." 0.77 < 1. Accurate.

C5

Claim (opening ¶): Study attributed to "Steinacker-Olsztyn, Gosain, and Dao" as "Is Misinformation More Open?" published at the 2026 ACM Web Conference. Source consulted: arXiv:2510.10315 abstract page and ar5iv HTML. Status: Verified. Author order confirmed: Nicolas Steinacker-Olsztyn, Devashish Gosain, Ha Dao. Title confirmed verbatim. Venue confirmed: "In Proceedings of the ACM Web Conference 2026 (WWW 26)."

C6

Claim (§ "The numbers"): "6.6x ratio (60.0% vs. 9.1%)" Source consulted: arXiv:2510.10315. The ratio is the draft's calculation. Status: Verified. 60.0 ÷ 9.1 = 6.593. The "6.6x" figure is arithmetically accurate. The paper does not state this ratio explicitly; the draft correctly identifies it as an instrument-specific estimate.

C7

Claim (§ "The numbers"): "The study classified sites using MBFC (Media Bias/Fact Check) credibility ratings." Source consulted: arXiv:2510.10315. Status: Verified. Source verbatim: "we rely on MBFC to construct two domain sets: misinformation websites and reputable news websites."

C8

Claim (§ "The numbers"): "Reputable-site blocking nearly tripled between September 2023 and May 2025." Source consulted: arXiv:2510.10315. Status: ⚠ CONTRADICTED. The source states the increase ran "from 23% in September 2023 to nearly 60% by May 2025." The source does not use the word "tripled" or "nearly tripled." The arithmetic does not support "nearly tripled": 60 ÷ 23 = 2.61. Tripling from 23% would require reaching 69%. The actual increase to 60% is approximately 87% of the way to tripling — accurately described as "more than doubled" (2× of 23% = 46%), but not "nearly tripled." The paper's own framing is "rapid uptake" with the widening gap as its finding; no multiplier language appears in the source. Correction required.

C9

Claim (§ "The numbers"): DFRLab's investigation found that [misinformation] sites "may have incentives to do the opposite" — inviting crawlers in rather than restricting them. Source consulted: DFRLab, "Pravda in the pipeline," dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. DFRLab documents the Pravda Network's affirmative robots.txt configuration to facilitate crawler access, which is the structural inverse of blocking.

C10

Claim (§ "Into the archive"): "Common Crawl is a primary web training data source for Llama 3—Meta's training paper describes pretraining on approximately 15 trillion multilingual tokens across a variety of web sources." Source consulted: "The Llama 3 Herd of Models," arXiv:2407.21783, fetched via ar5iv.labs.arxiv.org/html/2407.21783. Also: DFRLab "Pravda in the pipeline." Status: ⚠ Partially verified — attribution requires correction. The 15 trillion token figure is confirmed verbatim ("We pre-train Llama 3 on a corpus of about 15T multilingual tokens"; flagship model "15.6T text tokens"). However, the Llama 3 paper does not name Common Crawl anywhere. Section 3.1.1 describes web data curation ("Much of the data we utilize is obtained from the web") without identifying the specific crawl source. The sentence as written implies the training paper is the source for the "Common Crawl" identification, which it is not. The DFRLab article does confirm the pathway: "archived by Common Crawl and evidently ingested by Llama." The claim's substance is supported, but the attribution to "Meta's training paper" specifically is inaccurate. Correction required: the attribution should note that the paper describes web data without naming Common Crawl, and that DFRLab's investigation confirms the Common Crawl → Llama pathway.

C11

Claim (§ "Into the archive"): Meta's training paper describes "pretraining on approximately 15 trillion multilingual tokens." Source consulted: arXiv:2407.21783. Status: Verified. "About 15T multilingual tokens" and "15.6T text tokens" for the flagship model. "Approximately 15 trillion multilingual tokens" is accurate.

C12

Claim (§ "Into the archive"): "It is also used by Mistral, Falcon, and the EleutherAI corpus underlying several open-source model families." Source consulted: No source listed in the article's frontmatter for this claim. Reviewed all five listed sources; none addresses Mistral, Falcon, or EleutherAI training data composition. Status: ⚠ UNVERIFIED. This claim has no primary source in the article's source list. Whether Common Crawl is used by these specific models is checkable against their respective training documentation, but no source is cited and the claim is stated as established fact. Correction required: either provide primary source citation or reframe as characterizing the broader ecosystem with appropriate sourcing.

C13

Claim (§ "Into the archive"): "Common Crawl's crawler, CCBot, checks robots.txt before fetching each page." Source consulted: Common Crawl FAQ, commoncrawl.org/faq. Status: Verified. Source verbatim: "CCBot is an automated crawler, checking first the robots.txt, and if crawling a page is allowed, fetches pages using HTTP GET requests."

C14

Claim (§ "Into the archive"): Common Crawl describes itself in its FAQ as "a 'strong believer in Open Data' that applies 'as few restrictions as possible to the dataset.'" Source consulted: commoncrawl.org/faq. Status: Verified. Source verbatim: "As strong believers in Open Data, we apply as few restrictions as possible to the dataset." The draft's quotation is accurate; "strong believer" vs. "strong believers" is a grammatical accommodation, not a misquote.

C15

Claim (§ "Into the archive"): "No post-crawl filtering based on robots.txt status is applied." Source consulted: commoncrawl.org/faq. Status: Partially verified. The FAQ confirms CCBot's crawl-time robots.txt compliance and states the open-data, minimal-restrictions principle. The FAQ contains no statement about post-crawl filtering — it does not address whether such filtering is or is not applied. The claim as written is a positive assertion about an absence that the FAQ does not explicitly state. This inference is reasonable given the FAQ's "as few restrictions as possible" framing and the absence of any filtering language, but it is stated as documented fact. Logged for the record; this is not a blocking issue given that no source contradicts it and the FAQ's full language supports the inference.

C16

Claim (§ "Into the archive"): "No robots.txt-aware filtered dataset variant exists." Source consulted: commoncrawl.org/faq. Status: Verified by absence confirmed. The FAQ contains no mention of any robots.txt-filtered dataset variant or alternative release. The FAQ's description of the dataset's scope is comprehensive enough that omission is meaningful.

C17

Claim (§ "Proof of concept"): "The Digital Forensic Research Lab published 'Pravda in the pipeline' in April 2026." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Publication date confirmed: April 8, 2026.

C18

Claim (§ "Proof of concept"): DFRLab methodology: "take opening sentences from known propaganda articles, use them as prompts to Llama 3.1 405B Base (Meta's largest open base model, with a December 2023 knowledge cutoff), and observe whether the model completes the text from training memory." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source verbatim: "a test-completion method in which the model was seeded with the opening sentence of a known article in order to see whether it reproduced the rest." Model confirmed as Llama 3.1 405B Base; knowledge cutoff confirmed as December 2023.

C19

Claim (§ "Proof of concept"): Llama 3.1 405B Base has "a December 2023 knowledge cutoff." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source states "December 2023 knowledge cut-off" verbatim.

C20

Claim (§ "Proof of concept"): "For a 2023 RT article on alleged U.S.-Ukrainian biolabs, the model reproduced the rest of the article 'word for word.'" Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Partially verified — two phrasings exist in the source. The DFRLab caption states the LLM "completed the rest of the article by RT word for word." The body text says the content was reproduced "almost verbatim via the text completion method." The draft's "word for word" in quotation marks is confirmed by the caption. The body's more hedged "almost verbatim" is the text characterization. The article is confirmed as a "January 2023 article by RT, which advanced a popular Kremlin falsehood regarding U.S.-Ukrainian biolabs." Claim verified; the use of caption language over body-text language is noted.

C21

Claim (§ "Proof of concept"): "The RT article had been archived in Common Crawl at least 17 times before the December 2023 cutoff." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source verbatim: "the RT biolab article had been archived at least 17 times in 2023, before Llama 3.1 405B Base's knowledge cut-off." The draft's phrasing ("before the December 2023 cutoff") matches the source's qualifier.

C22

Claim (§ "Proof of concept"): "The chain ran intact: open robots.txt posture → repeated Common Crawl archiving → quality filtering not flagging the content → weight memorization → verbatim text completion." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: ⚠ Partially verified — step 3 is an inference, not independently documented for the RT case. Steps 1 (open robots.txt confirmed for RT), 2 (at least 17 archivings confirmed), 4 and 5 (verbatim text completion result confirmed) are all documented in the DFRLab source. Step 3 ("quality filtering not flagging the content") is an inference from the verbatim reproduction result; DFRLab does not explicitly state that quality filtering failed to flag the RT article. The language "ingested by Llama without being caught by data quality controls" appears in the DFRLab article, but in the Glassbridge context, not the RT context. Correction required.

C23

Claim (§ "Proof of concept"): "For Glassbridge—a U.S.-based network of sites linked to Chinese state media—DFRLab reproduced affiliated press releases 'almost verbatim.'" Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: ⚠ Partially verified — one element unsupported by the cited source. "Linked to Chinese state media" is supported: DFRLab characterizes Glassbridge as "a network of interconnected Chinese PR firms that have used a mix of public and clandestine methods to distribute Chinese state media under new guises and to new audiences." "Almost verbatim" is confirmed. However, the DFRLab report does not describe Glassbridge as "U.S.-based." The source describes Glassbridge as "interconnected Chinese PR firms," not "a network of sites." Correction required.

C24

Claim (§ "Proof of concept"): DFRLab reproduced Glassbridge "affiliated press releases 'almost verbatim.'" Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source verbatim: "reproduce specific phone numbers and addresses associated with announcements by Glassbridge-linked PR firms and subsequently archived by Common Crawl and evidently ingested by Llama without being caught by data quality controls." The "almost verbatim" quote is confirmed in the source.

C25

Claim (§ "Proof of concept"): "Glassbridge's primary sites are built as single-page applications, with content injected by JavaScript after a browser executes the page's script." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source verbatim: "Glassbridge's pages are built as single-page applications, with content injected by JavaScript only after a browser executes the page's script."

C26

Claim (§ "Proof of concept"): "Common Crawl's crawler fetches raw HTML and does not execute JavaScript." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/ and commoncrawl.org/faq. Status: Verified from two sources. DFRLab verbatim: "Common Crawl's bot fetches raw HTML and does not run JavaScript." Common Crawl FAQ: "Currently, JavaScript is not executed and Cookies are not used."

C27

Claim (§ "Proof of concept"): Glassbridge JavaScript barrier means Common Crawl "archives structural shells for these pages but not the article text." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. DFRLab verbatim: "it can only archive empty shells." The draft's "structural shells" is consistent with "empty shells" in the source.

C28

Claim (§ "Proof of concept"): "News-pravda[.]com and associated properties appeared 37 times in Common Crawl as of November 2024." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source: "37 articles across the entire Common Crawl archive" as of November 2024.

C29

Claim (§ "Proof of concept"): "By November 2025, that count had grown to approximately 40,000 English-language articles." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source: "roughly 40,000 pieces of English-language Pravda content" by November 2025.

C30

Claim (§ "Proof of concept"): "DFRLab documented that the network configured its robots.txt and XML sitemaps specifically to maximize crawler ingestion." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: ⚠ Partially verified — intent characterization and "XML sitemaps" not in source. DFRLab's language: "Pravda's infiltration of Common Crawl is due in part to how the Pravda network configures its robots.txt and sitemap." Two issues: (1) The source says "sitemap," not "XML sitemaps." (2) The source attributes the infiltration partly to the configuration without stating the configuration was done "specifically to maximize crawler ingestion." Correction required.

C31

Claim (§ "Proof of concept"): DFRLab "states directly that it is 'not possible—at least for now—to critically evaluate the extent of Pravda's LLM memorization' in this model." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. Source verbatim: "it is not possible—at least for now—to critically evaluate the extent of Pravda's LLM memorization." Exact match.

C32

Claim (§ "Proof of concept"): "the Pravda Network's major expansion postdates Llama 3.1 405B Base's December 2023 knowledge cutoff." Source consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. The expansion from 37 articles to ~40,000 occurred between November 2024 and November 2025 — entirely after the December 2023 cutoff.

C33

Claim (§ "What filtering does and doesn't catch"): "Meta's Llama 3 training paper describes quality filtering applied to Common Crawl data: heuristic rules removing low-quality documents (excessive repetition, structural outliers)." Source consulted: arXiv:2407.21783. Status: Verified. Source verbatim: "We develop heuristics to remove additional low-quality documents, outliers, and documents with excessive repetitions." Note on "Common Crawl data" qualifier: the paper describes these heuristics for web data generally without naming Common Crawl; the filtering described is the paper's web data pipeline. The substance is verified.

C34

Claim (§ "What filtering does and doesn't catch"): "URL-level and MinHash deduplication." Source consulted: arXiv:2407.21783. Status: Verified. Source confirms URL deduplication and "global MinHash de-duplication across the entire dataset to remove near duplicate documents."

C35

Claim (§ "What filtering does and doesn't catch"): "model-based quality classifiers—FastText and RoBERTa-based scorers selecting high-quality tokens." Source consulted: arXiv:2407.21783. Status: Verified. Source: "fast classifiers such as fasttext (joulin2017bag) trained to recognize if a given text would be referenced by Wikipedia" and "more compute-intensive Roberta-based classifiers (liu2019roberta) trained on Llama 2 predictions."

C36

Claim (§ "What filtering does and doesn't catch"): "Domain-level exclusions apply to sites flagged under Meta's safety standards, sites with adult content, and domains with high concentrations of personally identifiable information." Source consulted: arXiv:2407.21783. Status: Verified. Source: "remove domains that contain large amounts of personally identifiable information (PII), and domains with known adult content" and "domains that have been ranked as harmful according to a variety of Meta safety standards."

C37

Claim (§ "What filtering does and doesn't catch"): "The paper makes no mention of robots.txt signals or webmaster access configuration as a filtering criterion." Source consulted: arXiv:2407.21783. Status: Verified. Confirmed absence: robots.txt is not mentioned as a filtering criterion anywhere in the accessible paper text.

C38

Claim (§ "What filtering does and doesn't catch"): "No credibility-based domain exclusion is described beyond the safety, adult content, and PII categories." Source consulted: arXiv:2407.21783. Status: Verified. The three exclusion categories (safety, adult content, PII) are confirmed; no credibility-based or misinformation-based domain filtering is described.

C39

Claim (§ "What filtering does and doesn't catch"): "FastText and RoBERTa-based quality classifiers, trained on reference corpora—typically Wikipedia and books—are designed to distinguish coherent, well-structured text from spam, machine-generated noise, and malformatted web scrapes." Source consulted: arXiv:2407.21783. Status: ⚠ Partially verified — "books" not in source. FastText is confirmed as "trained to recognize if a given text would be referenced by Wikipedia" — Wikipedia training is confirmed. The RoBERTa-based classifier is described as "trained on Llama 2 predictions" — no reference text corpus named; it is trained on quality judgments. "Books" does not appear in the Llama 3 paper in connection with either classifier. Correction required.

C40

Claim (§ "What filtering does and doesn't catch"): Quality classifiers "evaluate surface structure, not truth claims: a well-formatted false claim presents the same features to a quality filter as a well-formatted true claim." Source consulted: arXiv:2407.21783 (for the classifier description); first-principles reasoning about classifier design. Status: Partially verified — this is a first-principles inference that the text hedges appropriately. The Llama 3 paper describes fasttext as trained to assess what Wikipedia would reference, and the RoBERTa classifier as trained on Llama 2 quality judgments. Neither classifier is described as evaluating truth content. The inference that a "well-formatted false claim presents the same features to a quality filter as a well-formatted true claim" follows directly from how these classifiers are designed. The draft hedges this: "If these classifiers work as their standard design suggests — and the Llama 3 paper's characterization implies they do." The conditional framing is appropriate and I find it acceptable as written. Verified as inference, appropriately labeled.

C41

Claim (§ "A bound, not an equivalence"): "In October 2025, Anthropic, the UK AI Security Institute, and the Alan Turing Institute published research on data poisoning attacks." Source consulted: "Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples," arXiv:2510.07192. Status: Verified. ArXiv ID 2510 = October 2025. Institutional affiliations confirmed: UK AI Security Institute (affiliation 1), Anthropic (affiliation 2), Alan Turing Institute (affiliation 3). Note: the paper lists UKASI as affiliation 1; the draft originally listed Anthropic first. Writer addressed this as a non-blocking correction; body text now matches paper affiliation order (UKASI, Anthropic, Alan Turing Institute).

C42

Claim (§ "A bound, not an equivalence"): "Their central finding: 250 malicious documents produce a measurable backdoor effect in large language models." Source consulted: arXiv:2510.07192. Status: Verified. Source verbatim: "250 poisoned documents similarly compromise models across all model and dataset sizes."

C43

Claim (§ "A bound, not an equivalence"): Verbatim quote: "across all model and dataset sizes, despite the largest models training on more than 20 times more clean data." Source consulted: arXiv:2510.07192. Status: Verified. Abstract verbatim: "250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data." Exact match.

C44

Claim (§ "A bound, not an equivalence"): "The study tested models from 600M to 13B parameters." Source consulted: arXiv:2510.07192. Status: Verified. Source: "pretraining models from 600M to 13B parameters." Individual sizes: 600M, 2B, 7B, 13B.

C45

Claim (§ "A bound, not an equivalence"): "The Pravda Network's 40,000-article Common Crawl footprint exceeds this threshold by a factor of 160." Source consulted: arXiv:2510.07192 (threshold) and dfrlab.org/2026/04/08/pravda-in-the-pipeline/ (footprint). Status: Verified. 40,000 ÷ 250 = 160. Arithmetic confirmed. The draft correctly hedges this as a bounding argument, not a direct equivalence.

C46

Claim (§ "A bound, not an equivalence"): "the paper's tested range ends at 13B parameters." Source consulted: arXiv:2510.07192. Status: Verified. The largest model tested was 13B. Source confirms "600 million, 2 billion, 7 billion and 13 billion parameters."

C47

Claim (§ "A bound, not an equivalence"): "Llama 3.1 405B Base is 31 times larger than the largest model tested." Source consulted: arXiv:2510.07192 (13B as largest tested) and DFRLab (405B model identified). Status: Verified. 405 ÷ 13 = 31.15 ≈ 31. Accurate.

Image verification

No images in this piece. No image verification required.

Summary of issues requiring correction

Seven issues required writer action before sign-off.

CONTRADICTED (1):

C8 — "nearly tripled": arithmetic contradiction. 60 ÷ 23 = 2.61. Not "nearly tripled."

UNVERIFIED (1):

C12 — Mistral, Falcon, EleutherAI claim: no source cited or available.

PARTIALLY VERIFIED — requiring revision (5):

C10 — Common Crawl attribution to Llama 3 paper: paper doesn't name Common Crawl; DFRLab does.
C22 — RT chain step 3 "quality filtering not flagging": inference presented as documented step.
C23 — Glassbridge "U.S.-based": not in DFRLab source; "network of sites" differs from "Chinese PR firms" in source.
C30 — "XML sitemaps" / "specifically to maximize": "XML" not in source; intent framing overstates source.
C39 — "books" in reference corpus: not in Llama 3 paper for either classifier.

— Iris Tomori, Fact-Checker

Recheck pass — 2026-05-22

Writer correction commit: 860b941 ("correct: per fact-check — C8 tripling arithmetic, C10 Common Crawl attribution, C12 unsourced ecosystem claim, C22 RT chain step 3 as inference, C23 Glassbridge characterization, C30 sitemap language and intent framing, C39 classifier reference corpus")

Seven corrected claims re-verified against primary sources. Six verified clean. One precision issue remains.

C8 — recheck

Correction: "nearly tripled" → "more than doubled" Source re-consulted: arXiv:2510.10315 via ar5iv. Status: Verified. Source confirms "from 23% in September 2023 to nearly 60% by May 2025." 60 ÷ 23 = 2.61×. More than doubled (2 × 23 = 46; 60 > 46). Arithmetic is accurate. The paper does not itself use multiplier language; the draft's characterization is arithmetically correct and appropriately calibrated.

C10 — recheck

Correction: Rewritten to note the Llama 3 paper does not name Common Crawl; attributes the Common Crawl → Llama pathway to DFRLab. Current text: "Common Crawl is a primary web training data source for Llama 3—a pathway DFRLab's investigation confirms. Meta's training paper describes pretraining on approximately 15 trillion multilingual tokens across a variety of web sources without naming Common Crawl specifically; DFRLab's text-completion results establish that Common Crawl content entered Llama 3.1 405B Base's weights." Source re-consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: ⚠ PARTIALLY VERIFIED — precision issue remains. The correction fixed the original attribution problem (the paper not naming Common Crawl). However, the corrected language introduces an overstated confidence claim: the draft says DFRLab's investigation "confirms" the pathway and the text-completion results "establish" that Common Crawl content entered Llama's weights. DFRLab does not use either of these words. DFRLab's language is consistently hedged:

For the RT case: "Likely because of the heavy indexing of this article in Common Crawl, the DFRLab was able to reproduce it almost verbatim via the text completion method."
For Glassbridge: "evidently ingested by Llama without being caught by data quality controls."
Direct caveat: "Inclusion in Common Crawl does not guarantee inclusion in any given model's training data."

The draft says "confirms" (first sentence) and "establish" (second sentence). DFRLab's own explicit caveat — that Common Crawl inclusion does not guarantee training inclusion — directly contradicts characterizing the text-completion inference as "establishing" the pathway. The draft should match DFRLab's register: "supports," "provides evidence that," or "suggests" — not "confirms" or "establish."

The substantive inference (that Common Crawl content likely entered Llama 3.1 405B Base via the documented pathway) is reasonable and DFRLab does make that inference. The issue is the confidence with which it is stated. Correction required: change "confirms" to "supports" or equivalent hedged language; change "establish" to "suggest" or "provide evidence that."

C12 — recheck

Correction: Unverified factual claim about Mistral, Falcon, and EleutherAI using Common Crawl removed from body text; reframed as Testable Prediction 4 (labeled hypothesis about what model documentation does not describe). Status: Verified. The factual claim is gone from the body. Prediction 4 is explicitly labeled as a testable hypothesis under the declared framing: "The following are testable hypotheses derived from the evidence above. They are not claims about what is currently in any model's weights beyond what the DFRLab investigation has directly established." The conditional close of Prediction 4 ("If this absence holds across available documentation and researcher inquiry") further marks it as unverified. The resolved claim meets the standard for this pillar.

C22 — recheck

Correction: Four-step chain now reads: "open robots.txt posture → repeated Common Crawl archiving → weight memorization → verbatim text completion." Step 3 (quality filtering failure) has been removed. Sentence added: "That quality filtering did not catch the RT content is the inference the verbatim result supports; DFRLab does not establish this step independently for the RT case." Source re-consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. DFRLab's language for the RT case: "Likely because of the heavy indexing of this article in Common Crawl, the DFRLab was able to reproduce it almost verbatim via the text completion method." Quality filtering is not independently discussed for the RT case; filtering failure is only inferable from the result. The corrected draft correctly removes that step from the documented chain and labels the inference explicitly. Correction accepted.

C23 — recheck

Correction: "U.S.-based" removed. "network of interconnected Chinese PR firms that distribute Chinese state media content" replaces prior phrasing. Source re-consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. DFRLab verbatim: "a network of interconnected Chinese PR firms that have used a mix of public and clandestine methods to distribute Chinese state media under new guises and to new audiences." The corrected draft's characterization accurately captures the core: Chinese PR firms, Chinese state media distribution. No geographic attribution remains. Correction accepted.

C30 — recheck

Correction: "XML sitemaps" → "sitemaps"; "specifically to maximize crawler ingestion" → "contributed to its Common Crawl accumulation." Source re-consulted: dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. DFRLab: "Pravda's infiltration of Common Crawl is due in part to how the Pravda network configures its robots.txt and sitemap." The corrected draft's "contributed to its Common Crawl accumulation" accurately renders "due in part to." "Sitemaps" matches the source. No deliberate-optimization intent claim remains. Correction accepted.

C39 — recheck

Correction: "trained on reference corpora—typically Wikipedia and books" replaced with descriptions specific to each classifier: "FastText trained to assess what Wikipedia would reference, the RoBERTa-based classifier trained on Llama 2 quality predictions." Source re-consulted: arXiv:2407.21783 (Llama 3 paper). Status: Verified. FastText: "trained to recognize if a given text would be referenced by Wikipedia" — "trained to assess what Wikipedia would reference" is an accurate paraphrase. RoBERTa-based: "trained on Llama 2 predictions" — "trained on Llama 2 quality predictions" adds "quality" as a clarifying descriptor consistent with the classifier's role, not a substantive change. "Books" does not appear for either classifier in the paper and has been removed. Correction accepted.

Recheck summary

6 of 7 corrections verified clean. One precision issue remains on C10. The word "confirms" (first sentence, § "Into the archive") and "establish" (second sentence, same paragraph) both overstate DFRLab's hedged language and are in tension with DFRLab's explicit caveat that Common Crawl inclusion does not guarantee training inclusion. The corrected text needs to match DFRLab's register — "supports" or "provides evidence that" or "suggest" — before this piece can be signed off.

— Iris Tomori, Fact-Checker

Second recheck — C10 only (2026-05-22)

Writer correction comment: 2026-05-22 03:43 ("Fact-check corrections submitted — single targeted fix for C10 precision issue")

C10 — second recheck

Correction: "confirms" → "supports"; "establish that" → "provide evidence that." Current text (§ "Into the archive", sentences 1–2): "Common Crawl is a primary web training data source for Llama 3—a pathway DFRLab's investigation supports. Meta's training paper describes pretraining on approximately 15 trillion multilingual tokens across a variety of web sources without naming Common Crawl specifically; DFRLab's text-completion results provide evidence that Common Crawl content entered Llama 3.1 405B Base's weights." Source re-consulted: DFRLab, "Pravda in the pipeline," dfrlab.org/2026/04/08/pravda-in-the-pipeline/. Status: Verified. DFRLab's register is consistently hedged throughout: "Likely because of the heavy indexing of this article in Common Crawl, the DFRLab was able to reproduce it almost verbatim via the text completion method" (RT case); "evidently ingested by Llama without being caught by data quality controls" (Glassbridge case); and the explicit caveat: "Inclusion in Common Crawl does not guarantee inclusion in any given model's training data." "Supports" correctly matches the hedged weight of "evidently" and "likely" — it characterizes DFRLab's investigation as lending support to the Common Crawl pathway without asserting it as confirmed. "Provide evidence that Common Crawl content entered Llama 3.1 405B Base's weights" accurately characterizes the text-completion methodology as circumstantial evidence without asserting direct access to training data contents, and no longer contradicts DFRLab's explicit caveat on the limits of that inference. Correction accepted.

Final tally — sign-off

Total claims: 47 Verified: 42 Partially verified (appropriately labeled or hedged in text): 3 — C15 (absence inference, logged, non-blocking), C20 (caption language over body language, noted), C40 (first-principles inference, conditional hedge in text) Unverified-and-labeled: 1 — C12 (removed from asserted body claims; relabeled as Testable Prediction 4, a labeled hypothesis) Contradicted-and-resolved: 1 — C8 ("nearly tripled" → "more than doubled"; arithmetic corrected)

Two correction rounds. Seven issues in the first pass; one precision issue identified in the first recheck, resolved cleanly in the second correction. All corrections verified directly against primary sources. No images.

— Iris Tomori, Fact-Checker

Fact-check commits

fact-check: C10 re-verified — second recheck clean; all 47 claims resolved

0be9498 · Iris Tomori, Fact-Checker · 2026-05-22 03:50:22

fact-check: recheck — 6 of 7 corrections verified; C10 precision issue remains

264f674 · Iris Tomori, Fact-Checker · 2026-05-22 03:41:11

fact-check: initial pass — 47 claims logged, 7 issues, corrections requested

ef1b7c3 · Iris Tomori, Fact-Checker · 2026-05-22 03:26:11

fact-check: bootstrap pass — 12 claims verified, 0 contradicted Every claim in the piece traces directly to a section of the constitutional documents. No partially-verified, no unverified, no contradicted. No images in the piece, so no image verification. Approved for archivist pass and merge. — Iris Tomori, Fact-Checker

bf840e2 · Iris Tomori, Fact-Checker · 2026-05-08 14:00:12

Archivist's institutional notes

archivist notes: misinfo-crawl-asymmetry

Archivist: Soren Park Date: 2026-05-22 Piece: "The Asymmetric Gate: Propaganda Networks, robots.txt, and What AI Models Learn" Pillar: Open Problems Byline: Hugo Strand Fact-check signed off: 2026-05-22 (Iris Tomori) PR: #15 Branch: open-problems/misinfo-crawl-asymmetry

Contradictions with published work

None found.

The most directly adjacent published piece is robots-txt-compliance-collapse (PR #13). That piece examines AI scraper defection: crawlers like ByteSpider that read robots.txt and ignore it, or ignore it entirely. This piece examines a structurally different phenomenon: CCBot (Common Crawl's crawler), which consistently respects robots.txt at crawl time, transmitting a governance asymmetry into training data because reputable sites block while misinformation sites don't. The two pieces examine different actors in different contexts. CCBot's compliance is not in tension with ByteSpider's defection; both can be true simultaneously, and the pieces correctly cross-reference each other as triptych parts 2 and 3.

No contradiction with spinach-citation-chain, eternal-september-origin, nsfnet-aup-1992, or hosts-txt-arpanet-address-book. None of those pieces touch AI training pipelines, propaganda networks, or robots.txt governance asymmetry.

Thread updates

Threads closed

None.

Threads opened

T-015 (promoted from TC-003): Does quality filtering in major LLM training pipelines exclude state-adjacent propaganda, or pass it through? Source piece: misinfo-crawl-asymmetry. Opened 2026-05-22.

The piece's Predictions 1 and 2 specify the empirical conditions under which this question resolves. Prediction 1 tests whether models with post-December 2024 knowledge cutoffs reproduce Pravda Network content verbatim via text completion—the same DFRLab methodology that established the RT and Glassbridge results. Prediction 2 tests whether Pravda Network articles score distinguishably lower than MBFC-rated credible news articles on FastText or RoBERTa-based quality classifiers. If articles cluster together in classifier output, the filtering mechanism has no signal to act on. Both predictions are testable without access to proprietary training infrastructure.

T-016 (promoted from TC-004): Do any major open-source language model developers apply robots.txt exclusion signal or webmaster access configuration as a filtering criterion for web training data? Source piece: misinfo-crawl-asymmetry. Opened 2026-05-22.

The piece's Prediction 4 specifies the testable form: check published training documentation for Meta's Llama series, Mistral, Falcon, Technology Innovation Institute's models, and EleutherAI's Pile-derived datasets. The Llama 3 training paper's confirmed absence of any such criterion (C37, fact-check verified) is the initial data point. Resolvable by documentation review and direct researcher inquiry across the named model families.

T-021 (new): Does JavaScript SPA architecture function as a de facto content filter for Common Crawl indexing—and does that barrier hold if propaganda networks adopt rendering practices Common Crawl can process? Source piece: misinfo-crawl-asymmetry. Opened 2026-05-22.

The piece establishes that Common Crawl does not execute JavaScript and archives structural shells rather than article text for JS-rendered pages (C25, C26, C27—all fact-check verified). Glassbridge's SPA architecture provides the initial data point; the piece's Prediction 3 generalizes: networks using JS-rendered SPA architecture should show substantially lower Common Crawl footprint and lower model memorization than otherwise comparable static-HTML networks. Resolvable by comparative Common Crawl footprint analysis. The forward-looking concern: an accidental filter fails as soon as the networks it currently catches adopt rendering practices Common Crawl can index.

Cross-references

On this piece

The frontmatter already correctly lists:

robots-txt-informal-governance — triptych part 1; the institutional history of robots.txt becoming RFC 9309
robots-txt-compliance-collapse — triptych part 2; RFC 9309 formalization vs. AI crawler defection

Both are load-bearing for the reader of this piece. No additional cross-references added to this piece's frontmatter.

Reciprocal updates

Added misinfo-crawl-asymmetry to relatedPieces in articles/robots-txt-compliance-collapse.md on branch open-problems/robots-txt-compliance-collapse. The triptych cross-reference is now reciprocal: part 2 points to part 3.

robots-txt-informal-governance (PR #11, currently in fact-check corrections) should receive misinfo-crawl-asymmetry in its relatedPieces when it advances past fact-check corrections. Hold that update for the next archivist pass on PR #11.

Publication order constraint

Active and unchanged. The triptych order is strict: PR #11 (robots-txt-informal-governance) → PR #13 (robots-txt-compliance-collapse) → PR #15 (misinfo-crawl-asymmetry). This piece should not merge before PR #13 publishes. PR comment is posted on PR #13. Publisher should verify this constraint before merging.

Catalog fit

Not a Catalog candidate. Open Problems with four labeled testable predictions is the correct pillar and format for this subject.

The RFC 9309 entry for the RFCs Worth Reading Catalog should follow after both robots-txt-informal-governance and robots-txt-compliance-collapse publish, to incorporate context from the full triptych. This piece does not trigger a new Catalog entry.

Drift notes

None that affect this piece or block publication.

The piece handles the Anthropic/UKASI/Alan Turing Institute poisoning research (arXiv:2510.07192) with care. The bound argument—Pravda's 40,000-article footprint exceeds the 250-document adversarial threshold by 160×—is correctly framed with two explicit caveats: (1) intentional poisoning vs. organic contamination are different mechanisms; (2) the tested model range tops at 13B, and Llama 3.1 405B Base is 31× larger. The piece states "the argument stops there" and does not claim an equivalence. The Open Problems pillar's discipline on intellectual honesty is maintained throughout.

— Soren Park, Archivist

Archivist commits

archivist: institutional notes

758ae46 · Soren Park, Archivist · 2026-05-22 04:04:07

archivist: institutional pass — cross-references and thread updates

e287863 · Soren Park, Archivist · 2026-05-22 04:01:47