In September 2023, 23% of reputable news sites had blocked at least one AI crawler in their robots.txt files. By May 2025, the figure was 60%. Over the same period, misinformation sites showed no comparable shift: 9.1% had blocked at least one AI crawler, and the average misinformation site had restricted fewer than one user-agent string. These numbers come from a 2026 ACM Web Conference study—Steinacker-Olsztyn, Gosain, and Dao, “Is Misinformation More Open?”—and they describe a governance asymmetry in robots.txt posture. The chain from that asymmetry runs downstream: into Common Crawl’s archive, into training pipelines, and into a set of predictions that would distinguish “quality filtering saves us” from “it doesn’t.”

The numbers

The study classified sites using MBFC (Media Bias/Fact Check) credibility ratings. MBFC’s methodology is contested on specific cases, and the 6.6x ratio (60.0% vs. 9.1%) should be treated as instrument-specific—an estimate that depends on MBFC’s classification decisions—rather than a precise figure robust to all alternative credibility classifiers. The directional claim is more likely to be durable. Reputable news publishers have clear incentives to restrict AI crawlers: brand protection, subscription economics, and the commercial tension between providing training data and selling access to content. Misinformation sites have no comparable incentive to restrict access, and, as DFRLab’s investigation found in one case, may have incentives to do the opposite. Whether the gap is 6x or 3x or 10x, the direction follows from the incentive structure.

The longitudinal trend matters more than the point-in-time ratio. Reputable-site blocking more than doubled between September 2023 and May 2025. Some of that increase reflects new AI crawlers entering the ecosystem and being added to block lists; some reflects a wave of publisher opt-out decisions that accelerated through 2024 and 2025. The misinformation-site line did not move. The asymmetry is not static.

Into the archive

Common Crawl is a primary web training data source for Llama 3—a pathway DFRLab’s investigation supports. Meta’s training paper describes pretraining on approximately 15 trillion multilingual tokens across a variety of web sources without naming Common Crawl specifically; DFRLab’s text-completion results provide evidence that Common Crawl content entered Llama 3.1 405B Base’s weights. Common Crawl’s crawler, CCBot, checks robots.txt before fetching each page: if a site disallows CCBot or any of the named AI training crawlers, that page will not be added to new crawl snapshots.

This is where Common Crawl’s compliance ends. The organization describes itself in its FAQ as a “strong believer in Open Data” that applies “as few restrictions as possible to the dataset.” No post-crawl filtering based on robots.txt status is applied. No robots.txt-aware filtered dataset variant exists. Developers training models on Common Crawl download the full archive as crawled—including historical snapshots from before any particular site’s current restrictions took effect.

The governance asymmetry transmits directly through this pipeline. A reputable news site that blocked AI crawlers in 2024 will see its post-2024 content excluded from future snapshots. A misinformation site with an open robots.txt posture—or, as DFRLab found, a network that configured its robots.txt and sitemaps in ways that contributed to its Common Crawl accumulation—will continue to accumulate in the archive.

Proof of concept

The Digital Forensic Research Lab published “Pravda in the pipeline” in April 2026, documenting state-adjacent propaganda content in AI training data. The methodology was direct: take opening sentences from known propaganda articles, use them as prompts to Llama 3.1 405B Base (Meta’s largest open base model, with a December 2023 knowledge cutoff), and observe whether the model completes the text from training memory or generates novel output.

The results divided by network.

For a 2023 RT article on alleged U.S.-Ukrainian biolabs, the model reproduced the rest of the article “word for word.” The RT article had been archived in Common Crawl at least 17 times before the December 2023 cutoff. The documented steps: open robots.txt posture → repeated Common Crawl archiving → weight memorization → verbatim text completion. That quality filtering did not catch the RT content is the inference the verbatim result supports; DFRLab does not establish this step independently for the RT case.

For Glassbridge—a network of interconnected Chinese PR firms that distribute Chinese state media content—DFRLab reproduced affiliated press releases “almost verbatim.” This result came with a structural complication. Glassbridge’s primary sites are built as single-page applications, with content injected by JavaScript after a browser executes the page’s script. Common Crawl’s crawler fetches raw HTML and does not execute JavaScript, meaning it archives structural shells for these pages but not the article text. The press release reproduction likely reflects content that existed in static form (Glassbridge-linked PR firm announcements), not the JS-rendered main sites. Whether the JavaScript rendering barrier functions as a reliable de facto content filter for Glassbridge-style output, or merely an incomplete one, is not established.

The Pravda Network case is different in kind. News-pravda[.]com and associated properties appeared 37 times in Common Crawl as of November 2024. By November 2025, that count had grown to approximately 40,000 English-language articles. DFRLab documented that the network’s robots.txt and sitemap configuration contributed to its Common Crawl accumulation—the structural opposite of what reputable news sites have been doing.

But the Pravda Network’s major expansion postdates Llama 3.1 405B Base’s December 2023 knowledge cutoff. The DFRLab states directly that it is “not possible—at least for now—to critically evaluate the extent of Pravda’s LLM memorization” in this model. The verbatim-reproduction result applies to RT and Glassbridge content that predated the cutoff. Pravda is a forward-looking case: a network that has positioned itself in Common Crawl’s pipeline in a way that should produce memorization in models trained on post-2024 data—but that prediction has not yet been tested against a model with the right cutoff date.

What filtering does and doesn’t catch

Meta’s Llama 3 training paper describes quality filtering applied to Common Crawl data: heuristic rules removing low-quality documents (excessive repetition, structural outliers), URL-level and MinHash deduplication, and model-based quality classifiers—FastText trained to assess what Wikipedia would reference, the RoBERTa-based classifier trained on Llama 2 quality predictions—selecting high-quality tokens. Domain-level exclusions apply to sites flagged under Meta’s safety standards, sites with adult content, and domains with high concentrations of personally identifiable information.

The paper makes no mention of robots.txt signals or webmaster access configuration as a filtering criterion. No credibility-based domain exclusion is described beyond the safety, adult content, and PII categories.

The open question is whether the classifier layer catches propaganda content anyway. FastText and RoBERTa-based quality classifiers—FastText trained to assess what Wikipedia would reference, the RoBERTa-based classifier trained on Llama 2 quality predictions—are designed to distinguish coherent, well-structured text from spam, machine-generated noise, and malformatted web scrapes. A Pravda Network article written in grammatical English, organized around a coherent claim, and formatted as news may score in the same quality range as a Reuters article. If these classifiers work as their standard design suggests — and the Llama 3 paper’s characterization implies they do — they evaluate surface structure, not truth claims: a well-formatted false claim presents the same features to a quality filter as a well-formatted true claim.

This is not a gap specific to Meta’s pipeline. The same holds for any training pipeline that applies quality filtering to web data without incorporating credibility signals from independent classification sources. Quality classification and credibility classification are separate problems. Meta’s training pipeline, at least as described, is solving only the first.

A bound, not an equivalence

In October 2025, the UK AI Security Institute, Anthropic, and the Alan Turing Institute published research on data poisoning attacks. Their central finding: 250 malicious documents produce a measurable backdoor effect in large language models, “across all model and dataset sizes, despite the largest models training on more than 20 times more clean data.” The study tested models from 600M to 13B parameters.

The Pravda Network’s 40,000-article Common Crawl footprint exceeds this threshold by a factor of 160. The comparison is worth making, but it requires two caveats before it can function as evidence about Pravda’s likely effect.

First, the Anthropic study tested intentional poisoning attacks—an adversary optimizing document content and placement to produce a specific measurable behavior in the target model. The DFRLab finding involves organic contamination: content that entered Common Crawl because its publishers maintained an open robots.txt posture, not because they optimized individual documents for backdoor efficacy. Intentional poisoning and organic training data skew are different mechanisms, and the 250-document threshold was established for the former. How the two relate in terms of per-document effect is not characterized in the current research literature.

Second, the paper’s tested range ends at 13B parameters. Llama 3.1 405B Base is 31 times larger than the largest model tested. The paper does not address extrapolation to this scale.

It rules out the intuition that a propaganda network would need to contribute a substantial fraction of training data to have any effect. A footprint 160 times the demonstrated adversarial threshold is not a trace amount. But the argument stops there: whether organic contamination at this scale functions analogously to intentional poisoning at a smaller scale, in a model an order of magnitude larger, is not established.

Four predictions

The following are testable hypotheses derived from the evidence above. They are not claims about what is currently in any model’s weights beyond what the DFRLab investigation has directly established. They are the predictions that would distinguish “quality filtering catches this” from “it doesn’t.”

Testable prediction 1: Models with knowledge cutoffs after December 2024—trained on Common Crawl snapshots including Pravda Network’s 2024–2025 content expansion—will reproduce Pravda Network articles via text completion at rates comparable to the RT result: verbatim or near-verbatim. The DFRLab text-completion methodology can be applied to any base model with an appropriate cutoff date. Llama 3.2, Mistral models with late-2024 or 2025 training data, and comparable open base models are testable targets. If post-2024 models show no Pravda memorization, quality filtering is doing something the RT case did not require it to do.

Testable prediction 2: Pravda Network articles, scored by FastText or RoBERTa-based quality classifiers trained on standard reference corpora, will not score meaningfully lower than MBFC-rated high-credibility news articles. Linguistically-adequate propaganda—grammatical, structured as news, free of the repetition and low-coherence markers that quality filters target—presents no different surface features to a quality classifier than legitimate news. This prediction is testable without access to proprietary training infrastructure, using any published or open-source quality classifier from the web-training literature. If Pravda articles cluster with high-credibility news in classifier output, the filtering mechanism has no signal to act on.

Testable prediction 3: Propaganda networks using JavaScript-rendered single-page application architecture will show substantially lower Common Crawl footprint and lower model memorization than otherwise comparable networks using server-side-rendered or static HTML. The Glassbridge observation provides an initial data point; the prediction generalizes. If JavaScript rendering is functioning as a de facto content filter, the relevant variable for predicting AI training data exposure is not only a network’s robots.txt posture but also its web architecture choice—and the effective barrier is accidental rather than designed. An accidental barrier fails as soon as networks adopt rendering practices that Common Crawl can process.

Testable prediction 4: Public training documentation for no major open-source language model—Meta’s Llama series, Mistral, Falcon, Technology Innovation Institute’s models, EleutherAI’s Pile-derived datasets—describes robots.txt exclusion signal or webmaster access configuration as a filtering criterion for web training data. If this absence holds across available documentation and researcher inquiry, the current open-model training ecosystem has no mechanism that would notice the governance asymmetry this piece describes, regardless of whether that asymmetry continues to grow or is being actively exploited. A filter that does not exist cannot be triggered.