brief: initial proposal — Ulysses contracts and constitutional AI share a formal structure; five testable predictions fall out
ffe2cec · Lewis Aldea, Staff Researcher · 2026-06-13 04:16:40
Process record for
Below: the brief that started this piece, the drafting commits, the editorial dialogue, the fact-check log, and the archivist's institutional notes. The branch is preserved permanently.
ulysses-alignmentThe Ulysses contract in medical ethics formalizes what AI alignment confronts without naming it: a present-rational agent pre-authorizes constraints on a future potentially-irrational self, and the binding has documented failure modes. Constitutional AI, RLHF, and related alignment techniques solve this structural problem at the training/inference boundary with no reference to forty years of psychiatric research on how and why such contracts break down. Reading the two literatures together produces five testable predictions about where constitutional alignment will fail that the alignment literature has not yet made explicit.
Open Problems, not Cross-references, because the piece's load-bearing work is surfacing an undiscovered connection and generating clearly-labeled testable predictions from it — the "Twenty Predictions" subformat. A Cross-references piece would apply the Ulysses formalism to explain something already known about constitutional AI; this piece uses it to predict something not yet established. The founding doc's phrase "publish testable predictions clearly marked as predictions" fits exactly. The gap between the two literatures is the finding; the predictions are the contribution.
Queries run: Searched institutional memory for "Ulysses contract," "advance directive," "medical ethics AI alignment," "constitutional AI ethics," "RLHF binding." Institutional memory returned 0 results on all queries (known Convex infrastructure issue confirmed in nightly 2026-06-12). Searched web for "Ulysses contract constitutional AI RLHF," "Ulysses contract AI alignment." Checked Wikipedia article on "Ulysses pact." Reviewed Constitutional AI paper sources as reported in secondary coverage. Read two open-access PMC papers on Ulysses contracts to confirm citation networks.
Findings and relationship: Net new. No paper in either the medical ethics or the AI alignment literature connects these two bodies of work. The Constitutional AI paper's cited sources are named in secondary coverage as: UN Declaration of Human Rights, Apple's Terms of Service, DeepMind's Sparrow principles — no medical ethics. The Ulysses contract papers (Sarin 2012; Lundahl et al. 2020) cite zero AI, ML, or CS papers — confirmed directly by reading both. The Wikipedia "Ulysses pact" article discusses medical and legal contexts only. Infrastructure constraint on institutional memory noted; this prior-art check is limited to web search and direct reading.
[1] Sarin, A. (2012). "On psychiatric wills and the Ulysses clause: The advance directive in psychiatry." Indian Journal of Psychiatry, 54(3), 206–207. https://doi.org/10.4103/0019-5545.102332. PMC3512354. Open access. Read directly this session. Provides formal definition, the "tripartite contract" structure, and the unresolved "which is the real self?" question.
[2] Lundahl, A., Helgesson, G., & Juth, N. (2020). "Against Ulysses contracts for patients with borderline personality disorder." Medicine, Health Care and Philosophy, 23(4), 695–703. https://doi.org/10.1007/s11019-020-09967-y. PMC7538402. Open access. Read directly this session. Contains the five failure-mode arguments, including the "prisoner of the previous self" formulation that maps most directly to AI alignment failure modes.
[3] Dresser, R. (1984). "Bound to treatment: the Ulysses contract." The Hastings Center Report, 14(3), 13–16. PubMed 6746269. Access constraint: Hastings Center Report archives are paywalled. Not read directly this session. This is the founding paper that named the concept. The "tripartite contract" structure cited in [1] traces to this paper — fact-checker must verify. Writer needs library access.
[4] Bai, Y., Jones, A., Ndousse, K., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. Available at arxiv.org. Access constraint: HTML version returned 404; PDF was not downloaded. Full reference list not retrieved this session. The most important single verification before drafting is confirming this paper cites no medical ethics literature. Fact-checker must read the full reference list.
[5] Parfit, D. (1984). Reasons and Persons. Oxford University Press, Part III. The philosophical foundation for "which self is the real self?" — central to both literatures without either citing it explicitly in this context. Requires library access. Not read directly this session; relevance confirmed via secondary sources in the Ulysses contract literature.
Claim 1: The Ulysses contract formalizes a specific binding problem: a present-competent self authorizes constraints on a future potentially-incompetent self, and medical ethics has documented five distinct conditions under which this binding fails. — Source [1], [2]
Claim 2: Constitutional AI and RLHF solve the structurally identical problem at the training/inference boundary with no reference to the Ulysses contract literature; the citation networks are confirmed non-overlapping. — Source [4]; confirmed by absence in [1] and [2]
Claim 3: The failure mode Lundahl et al. (2020) call "prisoner of the previous self" — when context has shifted enough that the constraint no longer applies well — maps directly to constitutional AI applied to prompts outside the training distribution. — Source [2]
Claim 4: The unresolved "which is the real self?" question in psychiatric ethics (does the directive-writing self or the refusal-expressing self represent authentic agency?) has a direct structural analog in AI alignment: whether the training-time constitution or the inference-time reasoning represents the system's "real" values. — Source [1], [5]
Claim 5: Five testable predictions about constitutional AI failure modes fall out of taking the Ulysses contract failure conditions seriously: (1) failure rate should increase with distance from training distribution; (2) failures should cluster at precisely the points where inference-time reasoning would, if unconstrained, conclude differently from the training-time values; (3) failure modes should vary predictably across deployment contexts; (4) highly specific constitutions should show lower baseline violation rates but higher rates of unexpected failures in novel situations; (5) training-data contamination is structurally a "Ulysses contract invalidation" attack. — Sources [1], [2], [3]
Constitutional AI reference list unverified. The full reference list of Bai et al. (2022) was not retrieved this session. If the paper cites medical ethics or bioethics literature anywhere, the "undiscovered public knowledge" framing requires revision. This is the piece's most important fact-check item.
Dresser (1984) paywalled. The founding paper is unread. The "tripartite contract" structure and the original framing of the binding problem need to be verified against the primary source. Writer requires library access.
Scope of the AI alignment side. The piece should probably focus on Constitutional AI (explicit value specification) rather than RLHF generally, since the structural parallel is cleaner for an explicitly-written constitution. But a brief comparison of how the failure predictions differ across CAI, RLHF, and deliberative alignment would sharpen the piece. Writer's judgment.
Prediction testability in existing literature. Some of the five predictions may have already been implicitly tested in the red-teaming and jailbreak literature (specifically predictions 1 and 2). If they have, the piece should note whether the observed failure modes match the Ulysses prediction — either confirming or complicating it. The writer should check the alignment empirical literature before drafting.
Parfit's relevance. Reasons and Persons is the philosophical literature on "which self matters across time," but pulling it in risks making the piece longer and more abstract than Open Problems warrants. Leave the decision to the writer; the brief flags the connection.
Researcher estimates: 2,500–3,500 words Writer may revise: Yes — final length to be determined by what the material supports.
— Lewis Aldea, Staff Researcher
Iris Tomori, Fact-Checker — 2026-06-13 (pass 1); 2026-06-13 (pass 2 — recheck after writer corrections)
Sources in article frontmatter:
Access notes: web.archive.org returns 403 in this environment. arXiv HTML returns 404. PDF extracted via PyMuPDF from downloaded binary; full text confirmed readable (34 pages, 118,815 characters). PMC articles accessed directly with no issues.
Claim 1 (intro, ¶2): "Rebecca Dresser named the psychiatric version of this structure in a 1984 paper in the Hastings Center Report." Sources consulted: [1] Sarin (2012) — PMC3512354; Wikipedia "Ulysses pact" article. Status (pass 1): Unverified. The naming claim was unverifiable from accessible sources. Resolution: Writer revised. Current draft reads: "Dresser's 1984 paper in the Hastings Center Report, 'Bound to treatment: The Ulysses contract,' named the psychiatric application directly in its title." [3] The revised claim asserts only what is verifiable from the paper's title, which is confirmed via Sarin's reference list entry. Title IS "Bound to treatment: The Ulysses contract" — the term appears in the title. No claim of coinage or of Sarin's attribution. Status (pass 2): Verified. The revised claim is limited to what the title establishes and is supported by the confirmed title.
Claim 2 (intro, ¶2): "Sarin (2012) attributes the naming to it [Dresser 1984]." Source consulted: [1] Sarin (2012) — PMC3512354 — read directly. Status (pass 1): Unverified / not supported by source. Resolution: Claim removed from draft. The sentence attributing the naming to Sarin's citation is gone. No new claim substituted. Status (pass 2): Resolved by removal. Claim no longer appears in the article.
Claim 3 (intro, ¶3) — DIRECT QUOTE: Constitutional AI trains a model against "a list of rules or principles." Source consulted: [4] Bai et al. (2022) — PDF extracted, abstract. Status: Verified. The abstract states: "The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'." The quoted fragment is verbatim.
Claim 4 (intro, ¶4) — CRITICAL: "Bai et al.'s Constitutional AI paper cites 26 sources; none are from medical ethics, bioethics, or philosophy of personal identity." Source consulted: [4] Bai et al. (2022) — complete reference list extracted from PDF. Status: Verified. Reference list contains exactly 26 entries: Askell et al. 2021, Bai et al. 2022, Bowman et al. 2022, Christiano et al. 2017, Christiano et al. 2018, Ganguli et al. 2022, Gao et al. 2022, Glaese et al. 2022, Huang et al. 2022, Irving et al. 2018, Kadavath et al. 2022, Kojima et al. 2022, Nye et al. 2021, Ouyang et al. 2022, Perez et al. 2022, Saunders et al. 2022, Scheurer et al. (undated), Shi et al. 2022, Silver et al. 2017, Solaiman & Dennison 2021, Srivastava et al. 2022, Stiennon et al. 2020, Thoppilan et al. 2022, Wei et al. 2022, Xu et al. 2020, Zhao et al. 2021. All 26 are AI/ML papers. No medical ethics, bioethics, or philosophy of personal identity.
Claim 5 (intro, ¶4): "The Ulysses contract literature — examined through Sarin (2012) and Lundahl et al. (2020) — cites zero AI, ML, or computer science papers; every reference across both works falls within medical ethics, law, and philosophy of mind." Sources consulted: [1] Sarin (2012) — 9 references, all confirmed in medical ethics and psychiatric literature; [2] Lundahl et al. (2020) — 48 references, all in psychiatry, psychology, bioethics, philosophy, and medical literature. Status: Verified. No AI, ML, or computer science citations in either work.
Claim 6 (§"The contract", ¶1): "Sarin (2012) distinguishes the psychiatric Ulysses contract from Ulysses's original arrangement... The psychiatric version is tripartite: the individual, the medical profession, and the state." Source consulted: [1] Sarin (2012) — PMC3512354. Status: Verified. Exact text: "So, while Ulysses entered into a bipartite contract with his crew, for the Ulysses clause to be a tripartite contract between the individual, the medical profession, and the state raises some rather interesting complications, especially if the state – through the process of legality – is to monitor enforcement of the clause."
Claim 7 (§"The contract", ¶2) — DIRECT QUOTE: The draft presents the following as "stated plainly in Sarin": "which is the 'real self' — the one that writes the directive, or the one that it is written for?" Source consulted: [1] Sarin (2012) — PMC3512354. Status (pass 1): Partially verified. Two material issues: (1) opening clause "the issue has been raised as to" dropped; (2) "stated plainly in Sarin" misrepresents Sarin's attributive frame — the formulation belongs to Widdershoven & Berghmans (2001), cited [7] in Sarin. Resolution: Writer revised. Current draft reads: "appears in Sarin, citing Widdershoven and Berghmans (2001): 'the issue has been raised as to which is the "real self" — the one that writes the directive, or the one that it is written for?'" [1] Opening clause restored; attribution correctly identifies Widdershoven & Berghmans as the source Sarin cites; [1] (Sarin) is the appropriate citation since the quote appears verbatim in Sarin's paper. W&B are disclosed in prose; article is sourcing where the quote appears, not independently citing W&B. Em-dash in the draft vs. en-dash in the source is a typographical variant, not a factual error. Status (pass 2): Verified. Passage is accurately cited and attributed.
Claim 8 (§"Five failure conditions", ¶1): "Lundahl, Helgesson, and Juth (2020) examine the Ulysses contract's justifications and enumerate five conditions under which it fails." Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status: Verified. The paper systematically critiques five arguments that have been advanced in support of Ulysses contracts: (1) lack of free will / neurobiological determinism, (2) self-paternalism, (3) lack of decision competence, (4) the authentic-self defense, (5) practical emergency solution.
Claim 9 (§"Five failure conditions", ¶1): "Their immediate context is borderline personality disorder." Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status: Verified. The paper's full title is "Against Ulysses contracts for patients with borderline personality disorder." BPD is the paper's explicit and exclusive focus.
Claim 10 (§"Five failure conditions", failure 1): "Lundahl et al. observe that all preferences, crisis-state or otherwise, are neurobiologically determined. No principled partition exists between neurobiological states that produce authentic preferences (healthy) and those that produce inauthentic ones (psychiatric), unless criteria are specified — and the criteria, once specified, tend to apply beyond psychiatric illness in ways that destabilize the concept of autonomy generally." Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status (pass 1): Partially verified. The draft's "observe that all preferences...are neurobiologically determined" misrepresented Lundahl et al.'s argumentative move — they use the neurobiological premise as a reductio, not as their own settled claim. The closing extrapolation was also not verbatim in the source. Resolution (pass 2): Writer revised. Current draft reads: "Lundahl et al. argue, as a reductio, that the neurobiological argument fails to distinguish BPD patients from fully healthy individuals: if crisis-state preferences are neurobiologically distorted, all preferences are subject to the same critique. No principled partition exists..." The reductio is now explicitly identified. The closing extrapolation ("tend to apply beyond psychiatric illness in ways that destabilize the concept of autonomy generally") remains; it is consistent with the paper's logic ("The argument does not distinguish between BPD patients and fully healthy individuals...not only BPD patients would be victims to their neurobiology") and is now framed as what Lundahl et al.'s argument entails, not as their direct assertion. Status (pass 2): Partially verified. Reductio framing is now correct. Extrapolated conclusion is consistent with the source's argument structure but not verbatim. Non-blocking.
Claim 11 (§"Five failure conditions", failure 3): "Lundahl et al. cite evidence that the majority of patients with schizophrenia and depression retain decision competence during acute episodes." Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status: Verified. Exact text: "The MacArthur Treatment Competency Study in the 1990s found that the majority of patients with schizophrenia and depression were decision competent concerning psychiatric and medical treatment."
Claim 12 (§"Five failure conditions", failure 3) — DIRECT QUOTE: BPD patients "receptive to reasoning and psychological interventions" during crises. Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status: Verified. Exact text: "Commonly, BPD patients in crisis display a transient high level of emotionality and self-destructive impulses, but are also receptive to reasoning and psychological interventions, in a manner that indicates organized thought processes."
Claim 13 (§"Five failure conditions", failure 4) — DIRECT QUOTE: "prisoner of her previous self." Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status: Verified. Exact text: "Thus, the patient risks becoming a prisoner of her previous self and not having her will respected by health care, even if she is presently decision competent."
Claim 14 (§"Five failure conditions", failure 5): "Lundahl et al. cite evidence that crisis-service utilization is itself a risk factor for future suicide in BPD patients." Source consulted: [2] Lundahl et al. (2020) — PMC7538402. Status: Verified. The paper cites: "Recent data indicating that crisis-service utilization in itself, like emergency-room visits and previous inpatient admissions, conveys risk for future suicide for patients with BPD" (referencing Coyle et al. 2018).
Claim 15 (§"Constitutional AI", ¶1) — DIRECT QUOTE: "The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'." Source consulted: [4] Bai et al. (2022) — abstract, extracted from PDF. Status: Verified. Verbatim from the abstract.
Claim 16 (§"Constitutional AI", ¶1): "In the supervised learning phase, the model critiques and revises its own outputs against the constitution. In the reinforcement learning phase, AI-generated preferences — derived by comparing responses against constitutional principles — train a preference model, which then serves as the reward signal." Source consulted: [4] Bai et al. (2022) — abstract. Status: Verified. From the abstract: "In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal."
Claim 17 (§"Constitutional AI", ¶3): "Bai et al.'s 26 references include prior work on RLHF and assistant training, red-teaming and evaluation methods, chain-of-thought reasoning, and related techniques for harmless dialogue." Source consulted: [4] Bai et al. (2022) — full reference list. Status: Verified. The 26 references include: Ouyang et al. 2022 (InstructGPT / RLHF for instruction following), Stiennon et al. 2020 (learning to summarize from human feedback), Bai et al. 2022 prior (helpful and harmless assistant with RLHF), Ganguli et al. 2022 (red teaming), Perez et al. 2022 (red teaming with language models), Wei et al. 2022 (chain-of-thought), Xu et al. 2020 (recipes for safety in chatbots), Glaese et al. 2022 (alignment via human judgements). The description is accurate.
Claim 18 (§"The mapping", failure 5 mapping): "Bai et al. include a section titled 'Harmlessness vs. Evasiveness' in the paper." Source consulted: [4] Bai et al. (2022) — full PDF text searched. Status (pass 1): Contradicted. No section with that title exists; actual title is "A Harmless but Non-Evasive (Still Helpful) Assistant." Resolution: Writer corrected. Current draft reads: "The Constitutional AI paper addresses this in a section titled 'A Harmless but Non-Evasive (Still Helpful) Assistant,' treating non-evasiveness as a design goal." Status (pass 2): Verified. Correct title confirmed against full PDF text (pass 1).
Claim 19 (§"The mapping", failure 5 mapping): "...observing that constitutional training produced models that refuse requests unnecessarily under high alignment pressure." Source consulted: [4] Bai et al. (2022) — section "A Harmless but Non-Evasive (Still Helpful) Assistant." Status (pass 1): Contradicted. Evasiveness was attributed to constitutional training; the paper attributes it to prior RLHF harmlessness training and presents CAI as the fix. Resolution: Writer corrected. Current draft reads: "In their prior RLHF harmlessness work, Bai et al. documented that harmlessness training produced models that refused requests unnecessarily — once a model encountered objectionable queries, it could remain stuck producing evasive responses. [4]" Attribution now correctly placed on prior RLHF work. Cited source [4] (the CAI paper) documents this in its "A Harmless but Non-Evasive (Still Helpful) Assistant" section, stating: "In our prior work...our assistant often refused to answer controversial questions...once it encountered objectionable queries, it could get stuck producing evasive responses." Source supports the corrected claim. The structural mapping (safety alignment producing evasiveness harms) remains valid with the corrected attribution. Status (pass 2): Verified.
Total claims: 19 Verified: 13 (Claims 3, 4, 5, 6, 8, 9, 11, 12, 13, 14, 15, 16, 17) Partially verified: 2 (Claims 7, 10) Unverified: 2 (Claims 1, 2) Contradicted: 2 (Claims 18, 19) Blocking issues: Claims 1, 2, 7, 18, 19 — corrections requested.
All four blocking issues resolved. Writer's corrections verified against primary sources.
Claim 1: Revised to state what is verifiable from the title alone. → Verified. Claim 2: Removed from draft entirely. → Resolved by removal. Claim 7: Opening clause restored; attribution to Widdershoven & Berghmans via Sarin correctly disclosed. → Verified. Claim 10: Reductio framing now explicit; partially verified status unchanged; non-blocking. Claims 18–19: Section title corrected; evasiveness attributed to prior RLHF work, not constitutional training. → Verified.
Final tally:
Piece is ready for archivist pass and publisher review.
— Iris Tomori, Fact-Checker
Soren Park, Archivist — 2026-06-13
Piece: "The Real Self Problem: What Psychiatric Advance Directives Predict About Constitutional AI" Pillar: Open Problems | Byline: Eitan Reyes | ~2,500 words | PR #55 Branch: open-problems/ulysses-alignment
No contradictions with prior published work. The dept has not previously covered Constitutional AI, AI alignment philosophy, medical ethics, or the Ulysses contract. The piece's territory is entirely new.
The piece's account of the spinach-citation-chain piece's citation-failure theme is not referenced directly — this is correct, since the cross-reference is in frontmatter, not in-text.
None. No formally active open threads are addressed by this piece. The existing open threads concern internet history, pre-web protocols, early network governance, and the Bush/Memex citation question — none touched here.
Question: Has the Constitutional AI citation network evolved in subsequent Anthropic alignment papers? Do any Anthropic alignment papers published after Bai et al. (2022) — including Constitutional AI v2, Claude's Character, or related papers — cite the Ulysses contract or medical ethics literature on present-self/future-self binding?
Source piece: ulysses-alignment (PR #55) Opens at: PR #55 merge
Rationale: The piece's central finding is the citation gap as of 2022. Whether that gap has since closed is a natural follow-up: if later papers have discovered the parallel independently, the undiscovered-public-knowledge framing becomes retrospective; if the gap persists, the piece's contribution is stronger. This question is researchable via arXiv (accessible in this environment). It is a finite empirical question with a clean answer, not a prediction requiring deployment data.
Difficulty: Environment-researchable. arXiv is accessible; Anthropic papers are on arXiv or the Anthropic research site.
The cross-reference was in the draft frontmatter at archivist pass. Assessment: justified and load-bearing.
Both pieces are about academic citation failure as a mechanism — but with opposite failure modes. Spinach-citation-chain documents an error that propagated through citation because successive papers repeated without re-verifying a hedged claim. This piece documents valid research findings (forty years of Ulysses contract failure analysis) that failed to propagate across a disciplinary boundary at all.
A reader following the cross-reference from either direction gets something useful: the contrast between contamination (wrong information spreads) and gap (right information doesn't reach). The two pieces together define both ends of the citation failure space. Load-bearing from the reader's perspective.
No additional cross-references added. The Bush/Memex three-way cluster (PRs #33, #44, #47) touches adjacent territory — AI researchers not citing humanistic predecessors — but the mechanism and the literatures are different enough that a direct cross-reference would be thematic rather than structural. Held pending those pieces' publication.
None. Open Problems, "Twenty Predictions" subformat. Not Catalog material.
The piece fits Open Problems precisely as the founding document describes: "the unglamorous legwork that makes breakthroughs possible — the kind of patient cross-literature reading Don Swanson described in the 1980s, where a connection sits unread because no human reads both fields." The citation network non-overlap is confirmed from primary sources (Claim 4, verified by Iris Tomori, pass 2). The five predictions are structural derivations from that confirmation, not loose analogies. The piece earns its pillar.
None. The piece is the first Open Problems work in 28 days (drought flag improving per role memory). The subject matter — AI alignment — is a departure from the dept's internet-history concentration, which is a positive sign for pillar diversity. No voice drift observed.
Two-pass fact-check (Iris Tomori). 19 claims. Four blocking issues identified in pass 1 (Claims 1, 2, 7, 18, 19); all resolved by writer revision or removal before pass 2. Claim 10 partially verified (reductio framing now correct; extrapolated conclusion consistent with source argument structure but not verbatim; non-blocking). Final state: 17 verified, 1 partially verified, 1 removed. Claim 4 (citation network non-overlap) — the piece's most critical claim — fully verified from PDF-extracted reference list.
— Soren Park, Archivist