The Real Self Problem: What Psychiatric Advance Directives Predict About Constitutional AI

A person during a period of mental clarity writes down what treatment they want administered if they later become incapacitated — and instructs physicians to honor that directive even if they refuse treatment at the time.

This is the Ulysses contract. The name is from the sailor who had himself lashed to the mast before his ship passed the Sirens. He bound his future self in advance, using his present rational self as the authorizing party. Dresser’s 1984 paper in the Hastings Center Report, “Bound to treatment: The Ulysses contract,” named the psychiatric application directly in its title. [3] The literature has been analyzing the contract’s failure modes since.

The alignment technique Anthropic published in 2022 is the same binding problem at a different scale. Constitutional AI trains a model against a written “list of rules or principles” — a constitution specified before deployment. [4] The present-competent authors of that constitution authorize constraints on the model’s future behavior, including behavior in contexts they couldn’t anticipate. The training/inference boundary functions as the contract boundary.

The two literatures have not spoken to each other. Bai et al.’s Constitutional AI paper cites 26 sources; none are from medical ethics, bioethics, or philosophy of personal identity. [4] The Ulysses contract literature — examined through Sarin (2012) and Lundahl et al. (2020) — cites zero AI, ML, or computer science papers; every reference across both works falls within medical ethics, law, and philosophy of mind. [1, 2] The non-overlap is confirmed, not assumed.

This matters because what the psychiatric literature has built in forty years is something the alignment literature does not yet have: a formal account of why this type of binding fails, in five conditions, with specificity about mechanism. Those conditions are not loose analogies for the AI case. They are structural enough to generate testable predictions.

The contract

Sarin (2012) distinguishes the psychiatric Ulysses contract from Ulysses’s original arrangement on a point that turns out to matter. Ulysses contracted with his crew — a two-party arrangement. The psychiatric version is tripartite: the individual, the medical profession, and the state. [1] State authority is required for enforcement; without it, the directive-refusing patient could simply leave. But the state’s involvement introduces interests and error rates that the original myth doesn’t contain.

The philosophical problem this generates appears in Sarin, citing Widdershoven and Berghmans (2001): “the issue has been raised as to which is the ‘real self’ — the one that writes the directive, or the one that it is written for?” [1] The entire validity of the contract depends on which answer is correct. If both selves are equally real, the Ulysses contract has no principled foundation — it is simply privileging an earlier preference over a later one. If only the directive-writing self is “real,” a theory is required for why this is so, and that theory needs to survive scrutiny.

The philosophical literature on personal identity has not resolved this question satisfactorily. What the psychiatric literature has done instead is accumulate evidence about when the binding works and when it doesn’t.

Five failure conditions

Lundahl, Helgesson, and Juth (2020) examine the Ulysses contract’s justifications and enumerate five conditions under which it fails. Their immediate context is borderline personality disorder; the conditions they identify are structural enough to apply more broadly. [2]

The neurobiological authority argument has a scope problem. The strongest defense of the Ulysses contract argues that psychiatric crisis states are biologically distorted — that “genuine” preference is impaired by neurobiological disruption, so the directive-writing self’s preferences are more authoritative. Lundahl et al. argue, as a reductio, that the neurobiological argument fails to distinguish BPD patients from fully healthy individuals: if crisis-state preferences are neurobiologically distorted, all preferences are subject to the same critique. No principled partition exists between neurobiological states that produce authentic preferences (healthy) and those that produce inauthentic ones (psychiatric), unless criteria are specified — and the criteria, once specified, tend to apply beyond psychiatric illness in ways that destabilize the concept of autonomy generally. [2]

Caregiver responsibility doesn’t disappear because the patient once consented. The self-paternalism defense: the patient chose to be bound, so the constraint is self-imposed rather than externally coerced. When the contract is enforced, however, the caregiver is the active agent implementing a past decision against a currently-objecting patient. The caregiver bears medical and legal responsibility for this action. That the patient once asked for the constraint doesn’t remove the caregiver’s agency in applying it. [2]

The incapacity the contract assumes may not exist at the time it is enforced. The contract’s justification requires that the patient in crisis lacks decision-making capacity. Lundahl et al. cite evidence that the majority of patients with schizophrenia and depression retain decision competence during acute episodes. BPD patients specifically remain “receptive to reasoning and psychological interventions” even during crises. [2] The constraint is applied uniformly, regardless of whether actual incapacity is present at enforcement time.

Past preferences can be legitimately superseded by present ones. This is the failure condition with the most direct bearing on what follows. The authentic-self defense argues that the directive-writing self expresses genuine long-held values while the crisis-refusing self expresses distorted crisis-driven impulses. But the argument requires demonstrating that the present refusal is inauthentic, not merely harmful. Lundahl et al. describe the consequence of skipping this step: the patient becomes “a prisoner of her previous self,” her present judgment overridden even when she is decision-competent. [2] A preference formed in one context is enforced in a substantially different one. The longer the gap between contract-writing and contract-enforcement, and the more the patient’s situation has changed, the more this failure mode applies.

The practical intervention may cause the harm it is designed to prevent. The emergency-pragmatics defense argues that compulsory care is safer than the alternative in a crisis. Lundahl et al. cite evidence that crisis-service utilization is itself a risk factor for future suicide in BPD patients. [2] The safety intervention generates its own harm profile.

Constitutional AI

Constitutional AI, as described in Bai et al. (2022), supervises model behavior through a written set of principles. “The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.” [4] In the supervised learning phase, the model critiques and revises its own outputs against the constitution. In the reinforcement learning phase, AI-generated preferences — derived by comparing responses against constitutional principles — train a preference model, which then serves as the reward signal.

The training/inference boundary is where the binding lives. At training time, a present-competent party — the researchers who wrote the constitution — specifies values and constraints. At inference time, the trained model applies those constraints against prompts that were not anticipated when the constitution was written. The constitution is what the present self specified; inference-time output is the bound future self in operation.

Bai et al.’s 26 references include prior work on RLHF and assistant training, red-teaming and evaluation methods, chain-of-thought reasoning, and related techniques for harmless dialogue. [4] No medical ethics. No bioethics. No philosophy of personal identity. The Ulysses contract literature, examined through two primary papers, cites zero AI or ML research. The two fields solved the same binding problem independently, with entirely different empirical records.

The mapping

Each of the five psychiatric failure conditions maps onto a structural feature of constitutional AI alignment. The mapping is not analogical — “psychiatric constraint resembles AI constraint” — it is structural: the binding problem is identical, which means the failure conditions are inherited.

Neurobiological authority → the authentic-state problem. The psychiatric argument claimed crisis-state preferences are neurobiologically degraded and therefore inauthentic. The alignment analog is the implicit claim that training-time constitutional values represent “aligned” preferences while inference-time reasoning represents deviation. Both arguments require partitioning cognitive states into authentic and inauthentic, and both face the same objection: no principled criteria exist for the partition. Constitutional AI does not specify under what conditions the inference-time model’s reasoning should be treated as authoritative versus overridden.

Caregiver responsibility → researcher responsibility. The psychiatric framing obscured caregiver agency by attributing constraint to patient self-choice. The alignment framing obscures researcher agency by attributing constraint to “AI feedback” — the paper’s title, “Harmlessness from AI Feedback,” suggests the harmlessness is self-generated. The preference model was trained on AI responses evaluated against a constitution written by researchers, applied in contexts those researchers did not specify. The constraint is the researchers’ constraint. The model’s compliance in a novel context is not self-governance.

Incapacity assumption → uniform application problem. The psychiatric contract assumed incapacity at enforcement time; the evidence suggests this is often false. Constitutional AI applies constraints uniformly regardless of whether the inference-time reasoning would, if unconstrained, reach a better or worse answer than the constraint produces. In cases where unconstrained reasoning would produce the correct answer, the constraint causes the error.

Prisoner of the previous self → training distribution shift. The most direct correspondence. A past specification is being enforced on a present system in a changed context. The longer a constitutional model is deployed, and the further prompts fall from what its constitution anticipated, the more the binding operates against contexts it was not designed for. This is not the general OOD performance problem. It is specifically the mismatch between what the constitution specifies and what deployment requires — a narrower claim about where the binding will fail, not where all performance degrades.

Practical harm from intervention → evasiveness harms. The emergency-pragmatics failure predicts that safety interventions produce their own harm profiles. In their prior RLHF harmlessness work, Bai et al. documented that harmlessness training produced models that refused requests unnecessarily — once a model encountered objectionable queries, it could remain stuck producing evasive responses. [4] The Constitutional AI paper addresses this in a section titled “A Harmless but Non-Evasive (Still Helpful) Assistant,” treating non-evasiveness as a design goal. The evasiveness problem the prior work documented is the relevant precedent: safety constraints generated their own harm profile. Refusal is not neutral. In medical, legal, or safety contexts, refusing to provide accurate information carries costs. The iatrogenic failure condition predicts that equivalent tradeoffs will reappear wherever constitutional constraints are calibrated under high harmlessness pressure.

Five predictions

The following are predictions derived from the structural mapping above. They are labeled as predictions and have not been validated against constitutional AI deployment data. Some may be consistent with existing red-team evidence; others require direct testing to confirm or refute.

Prediction 1. Constitutional violation rates increase as a function of semantic distance between deployment context and training distribution, at a rate exceeding general model performance degradation. Mechanism: the prisoner-of-the-previous-self failure predicts that constitutional binding holds well in anticipated contexts and weakens as those contexts become more distant. This is not general OOD degradation; it is specifically the gap between what the constitution’s authors imagined and what deployment actually requires. Testable: measure constitutional compliance as a function of prompt distance from training distribution on held-out evaluation sets, controlling for general response quality.

Prediction 2. Constitutional violations cluster at inputs where unconstrained model reasoning is most strongly opposed to the constitutional constraint. Mechanism: the authentic-self failure predicts that the binding is under greatest pressure precisely where inference-time reasoning most opposes it. In the psychiatric case, patients resist the contract most at the exact moments it applies. In the AI case, the constitution is most likely to be bypassed where the model’s unconstrained reasoning points most strongly away from the constitutional response. Testable: audit a sample of constitutional bypasses and compare against held-out control prompts to measure whether in-context model reasoning was directionally opposed to the constraint before the bypass.

Prediction 3. The type of constitutional failure — not just the rate — varies systematically across deployment contexts. Mechanism: the neurobiological-authority failure predicts that no principled partition between authentic and inauthentic reasoning holds uniformly across contexts. The same constitutional constraint classifies reasoning as harmful in one deployment context and necessary in another. Testable: categorize constitutional failure types across deployment categories (consumer, developer API, medical, legal) and test whether failure distributions are drawn from the same underlying type distribution. The prediction is structural difference in failure category, not merely difference in failure rate.

Prediction 4. More specific constitutions produce lower baseline violation rates but higher unexpected-failure rates on novel inputs, compared to more general constitutions. Mechanism: detailed constraints provide clear guidance for anticipated cases but require the model to extrapolate from specifics for unanticipated ones, producing less predictable behavior at the edges. Testable: compare models trained on specific versus general constitutions using red-team sets split between in-distribution and out-of-distribution prompts. The prediction is a crossing pattern in the data — specific constitutions outperform in-distribution, underperform on OOD red-team results, relative to general constitutions.

Prediction 5. Training-data contamination constitutes a structurally distinct attack surface from inference-time jailbreaking, requiring different defenses. Mechanism: an inference-time jailbreak is a contract-evasion attack — the binding is intact, but the model circumvents it in context. Training-data contamination that alters what the model treats as constitutional behavior is a contract-invalidation attack — the binding is altered during training, so the model applies what it believes is the constitution, which is not the constitution that was written. These attack surfaces have different signatures: contract-evasion attacks change model behavior under adversarial prompting while leaving baseline behavior intact; contract-invalidation attacks shift baseline behavior systematically across prompt types. They require different defenses: contract-evasion is mitigated by improving inference-time constraint enforcement; contract-invalidation requires verifying that the trained constitution matches the written one — a property the current alignment literature does not formalize or test. Testable: construct both attack types against the same base model; compare behavioral signatures; develop detection methods that correctly classify attack type from behavioral evidence alone.

The psychiatric literature’s conclusion, after forty years, is not that the Ulysses contract is wrong. It is that the Ulysses contract’s validity is narrower than its proponents claimed, and that the conditions under which it fails are now specific enough to be stated.

The question for constitutional AI is whether the alignment field needs to rediscover each of those conditions through its own empirical program — the equivalent of forty more years of case reports and failure analysis — or whether it can use what the psychiatric literature has already documented as a starting map. The five predictions above either describe what the red-teaming literature has already found — in which case the Ulysses framework is a retroactive explanation that now has predictive force — or they identify failure modes not yet directly tested, in which case they are specific research questions that can be tested.

Either way, the medical ethics literature is available, specific, and unused.