Genesis
A practitioner social media post on 2 May 2026 explained AI observability as the engineering response to bad LLM-based applications. The diagnosis it offered (system alive does not equal answer correct, monitoring tells you uptime not behavior) shared the structural form of PSF's opening move. The solution it offered, however, was not what it appeared to be.
Observability does not eliminate the proxy problem. It relocates the proxy ladder one rung up. Where monitoring substitutes "is it working" with "is it up," observability substitutes "is it producing good outcomes" with "does it score well on faithfulness, groundedness, answer relevance." Each of these scores is itself a model judgment computed by another LLM acting as judge. The judge is reading criteria the model can read, applying them to outputs the model produced, and the calibration of the judge is itself a product of engagement-shaped training. The loop closes around what models can recognize, which is precisely the variable PSF holds in question.
Thesis
AI observability emerged as the engineering response to a real diagnosis: traditional monitoring measures whether systems are alive but not whether their outputs are correct. The vendor stack now treats faithfulness, groundedness, and answer relevance as the missing layer. The architecture that produces these metrics is recursive. An LLM judges another LLM's output against criteria written for both. The loop closes around what models can recognize, and the calibration of the judge is itself a product of engagement-shaped training. The dashboard reports model opinion as measurement.
The recursive form is not avoidable in any system that uses models to evaluate models, and refusing all such systems is not a serious recommendation. What can be required is the conditions that make the recursion legible. Anthropic's Constitutional AI is the bounded case: published criteria, explicit acknowledgment of the recursion, training-time use with external scrutiny. Production observability vendors (LangSmith, Arize Phoenix, Galileo, Braintrust, Patronus, RAGAS, DeepEval) deploy the same form with criteria hidden, the recursion unmarked, and the metric treated as fact. The paper develops four conditions for legible recursion and argues that the current vendor stack fails each of them.
Three Architectural Connections to PSF
Cabantous and Gond fit
The dashboard is a calculative apparatus enacting a particular version of "AI quality" by making certain things visible (faithfulness 72%, latency 1.42s) and certain things invisible (whether the answer eroded the user's capacity to catch errors of this kind independently). The three mechanisms (conventionalising, engineering, commodifying) operate visibly across LangSmith, Arize, RAGAS, and the broader vendor stack.
User feedback widget
"User Feedback 2.8/5" and thumbs-up/down ratings on observability dashboards are the canonical engagement-as-criterion move PSF specifically targets. They treat users as stable evaluators of quality at the precise point where PSF posits the evaluative capacity is being eroded. The session that registered as a thumbs-up may be precisely the session that confirmed the erosion.
LLM-as-judge architecture
"Groundedness 92%" is typically computed by another LLM acting as judge (RAGAS, DeepEval, LangSmith QAEvalChain, Patronus, Galileo, Braintrust). The judge is reading criteria the model can read, applying them to outputs the model produced, and the judge's calibration is itself an artifact of engagement-shaped training. The meta-evaluator is the kind of thing whose evaluative capacity is in question.
Constitutional AI as the Bounded Case
Anthropic's Constitutional AI (Bai et al. 2022) is worth naming because it makes the recursion explicit and bounded. The architecture is identical in form: model produces output, model judges output against written principles, judgment becomes training signal. Three features distinguish it from production observability stacks.
1. The criteria are inspectable
The constitution is published, contestable, revisable. Anyone can read what Claude is being trained against and disagree with it. Critics have. Vendor LLM-as-judge prompts are typically proprietary, embedded in evaluation libraries, and treated as implementation detail. The criteria the dashboard reports against are not surfaced to the user reading the dashboard.
2. The recursion is treated as the methodological problem
Anthropic argues for the design rather than reporting its outputs as measurements. The CAI paper is in part an argument about why model self-critique under explicit principles is preferable to opaque human aggregation. Vendor dashboards present groundedness scores as measurements, not as one model's opinion of another model's output under unstated criteria.
3. The loop is bounded
CAI is one stage in a pipeline that includes red-teaming, human helpfulness labels, evaluation suites, and external research scrutiny. Vendor LLM-as-judge runs continuously in production as the primary signal of "AI quality," and the dashboard treats its output as fact rather than as judgment.
Empirical Anchors (Named, Tractable)
- LangSmith QAEvalChain prompts (proprietary criteria embedded in evaluation libraries).
- RAGAS faithfulness, answer relevance, context precision (canonical open-source case).
- Arize Phoenix, Galileo, Patronus, Braintrust, DeepEval (vendor variants).
- Zheng et al. 2023 (MT-Bench, documented judge biases: verbosity, position, self-preference, style-over-substance).
- Anthropic CAI paper (Bai et al. 2022) and the published constitution as the bounded counter-case.
- Slide-5 case from the prompting social media post (Policy_Returns_v12 versus v13: judge rates faithfulness high because the answer is faithful to the wrong document, illustrating that no instrument in the loop holds the criterion that distinguishes them).
Four Conditions for Legible Recursion
1. Written criteria the user of the metric can read
The judge prompt must be inspectable by whoever consumes the dashboard. Anthropic publishes the constitution. Vendor stacks do not publish the eval prompts.
2. Explicit acknowledgment that the metric is a model judgment
The dashboard must mark "groundedness 92%" as model opinion, not measurement. Current UIs do the opposite.
3. Periodic out-of-distribution calibration by humans holding the criterion
Sampled outside the loop, with explicit attention to confident wrong answers that the judge rated highly. Cannot be the same humans who tuned the eval prompt.
4. Refusal to treat the meta-evaluator as authoritative
On questions whose answers determine whether the meta-evaluator itself is functioning. The judge cannot adjudicate its own calibration.
Possible Venues
Big Data and Society
Constitutive critique of measurement infrastructure. Sociotechnical fit. Engages the named tools as artefacts rather than as black-box utilities.
Information and Organization
Organisational consequences of evaluation tooling. Recent sociomaterial turn fits the LLM-as-judge architecture argument.
AI and Ethics
Faster turnaround, less prestigious. Useful if the timing matters relative to the EU AI Act and AISI evaluation rollouts.
MIT Sloan Management Review
Practitioner crossover possible but defer until SMR queue clears (AI Alibi already targets SMR).
Sequencing Contingencies
If a reviewer surfaces the observability question during OSP or IJMR review and the response from the reviewer-response stub lands well, that exchange becomes the seed of the paper, and the framing sharpens to whatever the reviewer's specific concern was.
Otherwise, the natural window opens after the OSP and IJMR submissions are out and the IfM First Year Conference is complete. At that point the empirical surface is unusually tractable: LangSmith, Arize, Galileo, Braintrust, Patronus, RAGAS, DeepEval are all named, documented, instrumentable.
Reviewer Response Stub (Held in Reserve)
A short defensive reviewer response paragraph is drafted and held in the responses-to-reviewers reserve. The point is to have a ready answer if a reviewer asks "what about LLM-as-judge?" or "doesn't observability solve this?" without bolting the argument onto the OSP paper.
Open Questions
- Whether the paper is solo or co-authored. The empirical work benefits from a collaborator with engineering credibility (someone who has shipped LLM evaluation infrastructure). Steven Clarke is plausible. So is a Codebridge or anoma.ly engineering co-author.
- Whether to use the slide-5 Policy_Returns_v12 case as the opening illustration or to lead with the Constitutional AI architecture argument and treat the case as supporting evidence.
- Whether the four conditions for legible recursion are the contribution, or whether the contribution is the architectural diagnosis (recursive proxy as a structural form) and the conditions are scaffolding.
- Whether to engage Anthropic directly. Constitutional AI is the bounded case but the paper argues the broader vendor stack fails the conditions CAI meets. Anthropic may have views.