Dead Links and Dead Animals: What Taphonomy Can Teach Web Preservation Science

By 2014, more than 70 percent of URLs cited in sampled academic journals no longer produced what they had been cited to show. Half the URLs in published Supreme Court opinions were dead. [3] A 2013 study of 18,231 Web of Science abstracts covering 1996 to 2010 put the annual decay rate at 3.7 percent, with an R² of 0.96 — a relationship between age and death stable enough to have been predictable rather than merely measurable. [2]

What the URL survival literature hasn’t built is a correction method. The decay rate is measured. The scale of loss is documented. What’s missing is a framework for reasoning about what a surviving, archived web can tell you about the web that existed — the same problem paleontology faced sixty years ago.

Taphonomy solved it.

Taphonomy’s correction

Taphonomy is the science of how organisms enter the fossil record. Most don’t. The question that preoccupied paleontologists through the second half of the twentieth century wasn’t whether the fossil record was biased — it was — but whether the bias was random or structured. A random bias would make the record irreparably unreadable. A structured bias could be incorporated into how the record was interpreted.

The answer, developed through several decades of work and consolidated in studies like Darroch, Fraser, and Casey’s 2021 analysis of North American mammal taxa, is that the bias is structured. Body size is the primary predictor of taphonomic preservation potential, following a log-linear relationship:

log Fs′/Fe = −1.720 + 0.683 log W

where W is body mass in kilograms and Fs′/Fe is the ratio of sampled to expected carcasses. Applied to 374 North American mammal species, this equation produces an approximately lognormal distribution of preservation potentials, with the vast majority of species exhibiting low chances of fossilization — a long tail of poorly-preserved small-bodied organisms, a short tail of well-preserved large-bodied ones. [1]

The fossil record does not randomly sample past life. It samples a specific kind of life, in a way that is calculable before any fossil is found.

Darroch et al. also show that despite this bias, historical biogeographic patterns can be reconstructed from biased assemblages, provided the bias structure is understood. Their simulations show Kendall’s Tau correlations between modeled and actual species distributions dropping from 0.7–0.9 in unfiltered conditions to 0.0–0.4 when taphonomic filters are applied. Bird castings provide a supplementary data source: gastric pellets disproportionately preserve the small-bodied prey that taphonomic filtering removes from the main record. The pattern largely recovers — to 0.4–0.8 — when bird castings are incorporated. [1]

The principle: when you know which part of the record is systematically absent, you can reason about what it would take to compensate, and you can interpret the uncompensated record knowing what it is hiding. A biased record that is understood is recoverable. A biased record mistaken for a random one is not.

URL depth

Hennessey and Ge’s 2013 study examined 18,231 Web of Science abstracts spanning 1996–2010 and checked whether each URL they contained still resolved. Median URL lifespan: 9.3 years. Annual decay rate: 3.7 percent, R² = 0.96. Of published URLs, 69 percent remained accessible on the live web; 62 percent were archived by the Internet Archive; 21 percent by WebCite. [2]

The aggregate figures establish that URL decay is both rapid and consistent. The discipline breakdown establishes that it is structured. Computer Science URLs had a median lifespan of 8.3 years and 59 percent survival; Zoology URLs had a median lifespan of 11.2 years and 89 percent survival. That 30-point gap, across a study using identical methodology and period, is not noise. Something about how different fields publish and host their sources produces systematically different survival rates. [2]

For Internet Archive coverage specifically, URL directory depth was the dominant predictor, accounting for 45 percent of explained deviance. The Internet Archive appears to prioritize breadth over depth — whether because popular URLs happen to sit at lower depths, or because the crawl algorithm itself favors them — meaning shallow URLs are more likely to be captured regardless of content. A resource at example.com/paper.html has better archival odds than one at example.com/proceedings/2003/section4/subsection-b/paper.html, whether or not the second is more significant. [2]

This is a preservation potential model in everything but name. A structural property of the resource, independent of its content or importance, predicts whether it enters the archive.

Even within a single preservation medium, the same variability holds. Dominican amber preserves 93 percent of internal soft tissue in amber-entombed insects; French Charentes amber preserves zero percent. The mechanism is resin chemistry, not the significance of what was trapped. [4] Across the web, hosting infrastructure plays an equivalent role: not all cloud environments preserve equally, even when the content they carry is identical in form and weight.

The same structure

Both predictor variables are properties of the object being preserved, not properties of the object’s importance.

Taphonomy drew a specific conclusion from this: the structured nature of the bias makes correction possible. Paleontologists now routinely distinguish taphonomically biased assemblages from relatively unbiased ones. They interpret biased assemblages with the bias in mind rather than treating the sample as representative. The field developed this practice because the alternative — reading the fossil record as a random sample of past life — produced conclusions that didn’t hold up when checked against better-documented assemblages.

The URL survival literature identified the dominant predictor but hasn’t built the equivalent correction. Hennessey and Ge note the depth finding; they don’t derive a URL preservation potential function from it or apply that function to characterize the systematic gaps in existing archived collections. The literature knows the record is skewed. It has not yet asked what a corrected record would look like, or what a taphonomically screened archived snapshot — one with known bias properties — would allow historians to conclude that a raw snapshot would not.

Testable predictions

If URL survival follows a structure analogous to taphonomic preservation potential, three predictions follow.

First: URL survival curves corrected for hosting infrastructure (depth, domain stability, institutional embedding) would show a different historical distribution than raw survival data suggests. A corrected version would show which content categories were systematically over- and under-represented among survivors — the same way taphonomically corrected assemblages reveal which organism types are over- and under-represented in the fossil record. The surviving archive is not a random cross-section of what existed; it is a specific kind of skewed sample that can be characterized.

Second: the discipline gap in URL survival — Computer Science at 59 percent, Zoology at 89 percent — maps onto hosting infrastructure type rather than onto any inherent property of the disciplines’ content. Computer Science resources are disproportionately hosted on fast-moving commercial infrastructure: personal sites, conference proceedings on defunct academic servers, startup-hosted datasets. Zoology resources are disproportionately hosted on institutional servers with longer operational continuity. If correct, the 30-point survival gap is a hosting effect, quantifiable by regression analysis, and the disciplines’ content is incidental.

Third: applying taphonomic screening to archived web collections — identifying which URL classes are systematically over-represented versus absent — would allow web historians to flag biased samples. An archived snapshot of the web in 2004 is not a random sample of the web in 2004. It is a skewed sample, biased toward shallow-URL, institutionally-hosted, large-domain resources. That bias is characterizable and, in principle, correctable — the same way paleontologists characterize assemblage bias before drawing ecosystem conclusions.

The two literatures don’t cite each other. A web preservation paper referencing taphonomic methods would be notable enough to appear in a literature search; it doesn’t. A paleontology paper citing URL survival research would be stranger still. The fields are adjacent in structure but distant in subject matter, which is the usual condition for the kind of gap Don Swanson identified in the 1980s: a connection that sits unread because no one reads both sides of it.

What transfers from taphonomy to web preservation is not the formula. Log Fs′/Fe = −1.720 + 0.683 log W is a carcass equation; it has no URL equivalent. What transfers is the epistemological move: establish the preservation bias empirically, derive a function that predicts which objects enter the historical record, apply that function to correct for systematic absence before drawing historical conclusions. Taphonomy’s contribution to paleontology wasn’t the specific coefficients. It was the insistence that a skewed record, understood, is more useful than a random sample imagined.

The archived web is a biased sample of the web. It is biased in ways that are documented, structured, and predictable enough to have a dominant predictor variable accounting for 45 percent of explained deviance. What it doesn’t yet have is the discipline that treats understanding the bias as the prerequisite for using the archive.