PSF Literature Synthesis ← Evidence Constellation
Vikram Bapat, University of Cambridge · Updated 05 April 2026

Literature Synthesis: The Proxy Seduction Framework

0. Preamble: From EDT to PSF

This synthesis documents an evolving theoretical inquiry rather than a settled framework. The central empirical puzzle: organizations systematically misjudge AI engagement outcomes in ways that resist correction. The gap between expected and actual outcomes is not random noise. It is patterned, persistent, and present even among experienced practitioners with every incentive for accurate self-assessment.

The initial theoretical move, captured in Evaluative Discontinuity Theory (EDT), described this as criteria reconstitution: AI engagement transforms the evaluating organization such that the criteria by which it judges outcomes shift in ways that produce systematic gaps. EDT explained why prospective evaluation is structurally unreliable (Paul's transformative experience) and why reconstituted criteria become institutionally stabilized (Thornton et al., Weick, Argyris).

As evidence accumulated, it pointed toward a more specific account. The criteria, strictly speaking, do not reconstitute. An organization accountable for code robustness remains accountable for code robustness. What shifts is what practitioners at each level attend to when assessing whether that criterion is being met. The attentional account operates across three levels simultaneously. At the individual and cognitive level, Shaw and Nave's cognitive surrender documents how practitioners' evaluation becomes coupled to AI output quality without reliable self-awareness. At the organizational level, market-logic proxy metrics (speed, volume, throughput) become what gets measured, discussed, and rewarded, while professional-logic criteria remain nominally present but operationally recessive. At the field and institutional level, Barnesian performativity (MacKenzie, Callon) constitutes proxy metrics as the legitimate evaluative vocabulary before any organization tests their validity.

The current working account, which this synthesis calls the Proxy Seduction Framework (PSF), describes this as proxy seduction: the process by which proxy metrics become the operative evaluative vocabulary through sincere organizational belief rather than strategic gaming. That sincerity is the mechanism distinction from Goodhart's Law and what the prior literature lacks vocabulary for.

PSF is the present state of theoretical development, not a finalized theory. It should be read as a working account, responsive to evidence already accumulated and open to revision as the empirical phase proceeds. The theoretical architecture (four resources, three evaluative capacity dimensions, three levels) represents the current best understanding of what the data requires. Evidence from the planned empirical work, including approximately 50 interviews with frontline practitioners and boundary activity performers, may sharpen, qualify, or reframe elements of that architecture.

The four resources and the work each currently does:

ResourceSourceCurrent Function
Transformative experiencePaul (2014)Explains why practitioners cannot anticipate the attentional substitution before engagement; the criteria that would detect it only exist after the transformation
Form/function ontologyFaulkner and Runde (2019)Explains why the same AI tool produces different proxy-criterion gaps in different contexts; fertile form enables positioning drift through practice
Institutional logicsThornton, Ocasio, and Lounsbury (2012)Explains the displacement mechanism; market logic's metrics are legible, professional logic's criteria are not, and the asymmetry drives proxy substitution
Barnesian performativityMacKenzie (2006); Callon (2007); Cabantous and Gond (2011)Explains how field-level discourse (MacKenzie) and its material infrastructure of benchmarks, frameworks, and vendor demonstrations (Callon) constitute proxy metrics as legitimate evaluative categories before any organization tests their validity, with performative effects operating on the evaluation apparatus itself rather than only on organizational practices. Cabantous and Gond supply the three-mechanism taxonomy (conventional, generic, framing) specifying the channels through which Barnesian performativity propagates at the field level

Three evaluative capacity dimensions under investigation: detection (whether practitioners can sense divergence between proxy and criterion), judgment stock (whether practitioners have consequence-built tacit knowledge sufficient to discriminate proxy from criterion), and braking (whether evaluative institutional infrastructure can interrupt displacement once underway).

1. The Proxy Seduction Mechanism

Section Overview

Three philosophers, each operating at the level of the individual agent making evaluative choices, specify what happens before, during, and after proxy seduction takes hold. The logic is sequential: each moment depends on the one before it. Paul explains why the transformation cannot be anticipated. Nguyen explains how simplified metrics colonize reasoning once the practitioner is inside the transformation. Chang explains what is lost when the colonization succeeds. Together they provide the individual-level phenomenology that PSF's organizational-level claims rest on.

PSF's relationship to these three philosophers requires honest accounting. The proxy elevation mechanism is an application of Nguyen's value capture to AI-mediated knowledge work. The anticipation failure is Paul's transformative experience applied to organizational technology engagement. The hard-choice elimination is Chang's framework applied to practitioner decision landscapes. PSF's genuine contributions beyond these philosophers are the evaluative capacity erosion architecture (detection, judgment stock, braking) and the multi-level propagation mechanism (practitioner to organization to field). Neither Nguyen, Paul, nor Chang theorizes these dynamics.

The sequence also has a feedback structure. The loss Chang describes feeds forward into the next engagement cycle: a practitioner whose capacity to recognize hard choices has eroded is even less equipped to anticipate how the next round of AI engagement will transform their evaluative apparatus. The sequence deepens with each cycle.

1.1 Paul, L.A. (2014) Transformative Experience. Oxford University Press.

What Paul argues

Paul challenges the presupposition underlying standard expected utility theory: that rational agents can assess outcomes by projecting utilities and choosing accordingly. The challenge applies specifically to transformative experiences, those that are both epistemically and personally transformative.

An experience is epistemically transformative when you can only know what it is like by having it. Testimony from those who have undergone the experience cannot substitute for first-person acquaintance. An experience is personally transformative when it reshapes preferences, values, and identity in ways the pre-transformation self could not predict or evaluate from inside its current value structure. Parenthood is Paul's central case: epistemically transformative (you cannot know what parenting is like until you are a parent) and personally transformative (it changes what you care about in ways you could not have anticipated).

Paul's core argument: some experiences are doubly transformative. The you who would evaluate the outcome of the decision does not exist until after you have made it and undergone the transformation. The preferences you would need to evaluate the decision correctly are not available pre-transformation. More testimony, more careful analysis, more pilot programs, and better planning cannot solve this because the problem is not informational. It is structural.

The vampire case makes the structure vivid: you are offered the chance to become a vampire. You cannot know what it is to want blood, prefer darkness, and lose human attachments without becoming a vampire. The criteria by which you would judge the decision are the very criteria the transformation would replace. Choosing rationally in the standard sense (projecting expected outcomes and comparing utilities) is unavailable.

What PSF does with it

PSF applies Paul's structure to organizational AI engagement. AI engagement may be transformative in Paul's exact sense: the practitioners and organizational processes that will evaluate outcomes are not the same practitioners and processes that made the engagement decision. The engagement itself reconstitutes what practitioners attend to, what they find salient, and what counts as "good work."

PSF does not use Paul's resolution (treating transformative decisions as choices to value discovery itself). PSF uses Paul's diagnosis: the evaluative criteria needed to detect proxy substitution only emerge, or fail to emerge, through the engagement itself. A software developer who has worked with AI code generation for six months is not the same developer who would have caught the proxy-criterion divergence before engagement began. Anthropic's own engineers self-reported shifting to "70%+ code reviewer/reviser" roles, a transformation they recognized but did not fully anticipate before the engagement began. The detection capacity is retrospectively unavailable.

Critical PSF specification: Paul's framework is individual. PSF extends it to organizational evaluation processes. The practitioner's transformative experience is the micro-level event. What PSF theorizes is the organizational accumulation of those events and the institutional infrastructure (or lack thereof) through which individual transformed judgment aggregates into organizational evaluative capacity or its erosion.

PSF questions that remain open

How robust is the individual-to-organizational extension? Paul theorizes individual transformative experience. PSF applies it to organizations. The extension is defensible (organizational theory routinely aggregates individual-level phenomena) but needs explicit justification. What is lost in translation from person to organization?

Is AI engagement genuinely doubly transformative, or only epistemically transformative? If AI engagement changes what practitioners know but not what they value, the Paul warrant is weaker. The empirical phase should probe for evidence of personal (not just epistemic) transformation: has what you care about in your work changed, or just what you know about your work?

Does the strength of transformative experience vary by domain? Paul's framework does not predict variation. PSF's empirical phase should test whether some domains (software development, customer support, creative writing) produce stronger transformative effects than others, and whether that variation tracks with AI's "fertile form" properties.

Literature move: Borrowed. Paul's epistemological structure is PSF's foundation for why pre-engagement evaluation is structurally unavailable. PSF does not claim to extend Paul theoretically. It borrows her diagnosis and applies it to an organizational phenomenon.

Role in the sequence: Paul occupies the "before" position. She explains why the practitioner enters engagement unable to anticipate how it will reshape what counts as good work. Pre-engagement criteria feel stable. They are not. This is the precondition for everything that follows.

1.2 Nguyen, C.T. (2020, 2024, 2026) Value Capture

Key works:

Nguyen, C.T. (2020). Games: Agency As Art. New York: Oxford University Press. Ch. 9: "Gamification and Value Capture," pp. 189-215. DOI: 10.1093/oso/9780190052089.003.0009

Nguyen, C.T. (2024). "Value Capture." Journal of Ethics and Social Philosophy, 27(3). DOI: 10.26556/jesp.v27i3.3048

Nguyen, C.T. (2021). "How Twitter Gamifies Communication." In J. Lackey (ed.), Applied Epistemology. New York: Oxford University Press, pp. 410-436.

Nguyen, C.T. (2026). The Score: How to Stop Playing Somebody Else's Game. New York: Penguin Press.

Also: Artificiality Institute podcast, "Metrification" (March 2024). Sean Carroll, Mindscape podcast, Episode 169 (October 2021).

What Nguyen argues

Value capture occurs when an agent's values are rich and subtle (or developing in that direction), the agent enters a social environment that presents simplified (typically quantified) versions of those values, and those simplified articulations come to dominate the agent's practical reasoning. Examples: FitBit step counts, Twitter Likes and Retweets, citation rates, GPA, ranked lists of "best schools."

Why it happens: Simplified value articulations have a "competitive advantage" in practical reasoning. They are clear, portable, and comparable. They function well in both private deliberation ("Am I doing well?") and public justification ("Here is evidence I am doing well"). Richer values are harder to articulate, harder to compare across contexts, and harder to defend in institutional settings that demand legibility.

The autonomy claim: In value capture, the agent outsources a central component of autonomy: the ongoing deliberation over the exact articulation of values. The agent stops adjusting values in light of rich, particular, context-sensitive experience and instead "buys values off the rack." The metrics to which deliberation is outsourced are typically engineered for the interests of some external force (an institution's interest in cross-contextual comprehensibility, quick aggregability, or scale).

The seduction mechanism (from Games, Ch. 9): Games offer "seductive experiences of value clarity." In games, values are clear, achievements are quantifiable and rankable, and the agent knows exactly what they are doing and why. This is unproblematic in a gaming context because it is temporary. Gamification exports this clarity into real-world activities, forcing a singular clarified value system onto domains where values are inherently plural and contested.

The reward: Value capture offers a "delightful reward." Once an agent permits value capture, values become clear, coherent, and actionable. The agent experiences motivational focus and a sense of progress. The capture feels like improvement, not loss.

The group-level extension: Value capture afflicts groups as well as individuals. A philosophy department can be captured by its university's focus on student evaluation scores. Even when a group agrees that they care more about some inchoate value (like fostering curiosity), day-to-day decisions end up driven by whatever clear metrics happen to be on hand.

What PSF does with it

The honest relationship: PSF's proxy elevation mechanism is an application of Nguyen's value capture to AI-mediated knowledge work. The parallels are structural, not approximate:

Nguyen's "seductive experiences of value clarity" = PSF's "irresistible legibility"

Nguyen's "competitive advantage of crisp articulations" = PSF's account of why proxy criteria dominate

Nguyen's "the simplified versions take over" = PSF's proxy elevation

Nguyen's "delightful reward" of post-capture clarity = PSF's observation that proxy seduction feels like progress

Nguyen's "buying values off the rack" = PSF's account of practitioners adopting standardized metrics (speed, volume, certainty) because they are available and defensible, not because they capture what matters

The substitution mechanism is the same phenomenon described in different vocabulary.

What PSF genuinely adds (two contributions Nguyen does not provide):

First, evaluative capacity erosion. Nguyen theorizes value distortion (the wrong values dominate) and notes that the distortion feels good. He stops there. PSF specifies that engagement simultaneously degrades the capacity to detect the distortion, and breaks that degradation into three assessable dimensions (detection, judgment stock, braking). This is not a relabeling of Nguyen. Nguyen has no equivalent claim. For Nguyen, the agent's values are captured. For PSF, the agent's capacity to notice the capture is simultaneously degraded, and that degradation follows a specific architecture.

Second, multi-level propagation mechanism. Nguyen has a brief group-level extension. PSF integrates institutional logics (Thornton, Ocasio, and Lounsbury) and Barnesian performativity (MacKenzie) to trace how proxy elevation propagates from practitioner to organizational to field level, with each level reinforcing the others through feedback loops. That is structural machinery Nguyen does not build.

A third possible contribution, less certain: PSF claims that AI's "fertile form" (Faulkner and Runde) constitutes the proxy metrics through the engagement process itself, rather than the institution pre-specifying them. If this constitutive claim holds, it distinguishes AI-mediated value capture from Nguyen's general case (where the metrics pre-exist the engagement). Whether this distinction is deep or merely contextual is an open question.

Where Nguyen strengthens PSF:

Philosophical grounding for the "competitive advantage" of legible metrics. Nguyen provides the most developed philosophical account of why simplified metrics win in practical reasoning. Having Nguyen's independent philosophical warrant means the claim that proxy elevation is structurally predictable rests on established philosophical argument, not just PSF's assertion.

The autonomy dimension. Nguyen's argument that value capture outsources deliberation connects to PSF's judgment stock erosion. When practitioners stop deliberating over the exact articulation of their evaluative criteria (because the proxy metrics are so clear), the deliberative capacity itself atrophies. Nguyen provides the philosophical warrant for why this atrophy is structurally expected.

The Goodhart's Law differentiation. Nguyen's framework is deeper than Goodhart's Law. Goodhart describes gaming (agents strategically optimizing against a measure). Nguyen describes sincere capture (agents genuinely coming to value the proxy). PSF inherits this distinction. Having Nguyen's independent philosophical account strengthens PSF's insistence that proxy seduction operates through sincere belief, not strategic gaming.

PSF questions that remain open

How deep is the constitutive distinction? PSF claims AI engagement constitutes proxy metrics through the engagement itself, whereas Nguyen's value capture operates on pre-existing institutional metrics. This is potentially PSF's strongest claim to distinctiveness from Nguyen. But it needs honest scrutiny: is the distinction genuinely structural (a different causal pathway), or is it merely contextual (the same mechanism in a different setting)?

Reversibility. Nguyen describes value capture as potentially reversible (game values can be put away, an agent can recognize capture and resist). PSF describes a progressive, self-reinforcing process where the capacity to detect the capture erodes over time. Is AI-mediated value capture structurally less reversible than other forms, and if so, why?

How does "value collapse" map onto PSF's detection dimension? Nguyen's value collapse concept (overly explicit articulations narrow what the agent even considers) may be doing similar work to PSF's claim that evaluative capacity erosion narrows the practitioner's perceptual field. Worth investigating whether Nguyen provides additional precision that PSF should absorb.

The "engineered for external interests" claim. Nguyen argues that captured metrics typically serve some institution's interest in aggregability and control. PSF argues that proxy seduction operates without strategic design. These are not necessarily incompatible (the institution can benefit from the substitution without having engineered it), but the relationship needs clarifying, particularly for the empirical phase where interview data may reveal institutional actors who do deliberately promote proxy metrics.

Literature move: Applied. PSF's proxy elevation mechanism is Nguyen's value capture applied to AI engagement contexts. PSF's genuine contributions beyond Nguyen are the evaluative capacity erosion architecture and the multi-level propagation mechanism. Nguyen functions as the strongest independent philosophical corroboration of the substitution mechanism and as the clearest differentiation from Goodhart's Law (sincere capture, not strategic gaming).

Role in the sequence: Nguyen occupies the "during" position. Once the practitioner is inside the transformation Paul describes, Nguyen's mechanism takes hold. A customer support team using AI sees resolution time drop and ticket volume rise. Both metrics are immediately legible. Whether the resolutions actually address the customer's underlying problem, whether agents are developing the diagnostic judgment to handle cases AI cannot resolve: none of these are as easy to track, so they quietly recede from attention. The capture is sincere. It feels like progress.

1.3 Chang, Ruth (2017, 2025) Hard Choices and Parity

Key works:

Chang, R. (2025). "Two Mistakes in AI Design." Oxford Colloquium, February 2025.

Chang, R. (2017). "Hard Choices." Journal of the American Philosophical Association.

Chang, R. (2002). "The Possibility of Parity." Ethics, 112(4), pp. 659-688.

What Chang argues

Chang identifies a fourth evaluative relation. Standard decision theory recognizes three relations between options: better than, worse than, and equally good. Chang argues for a fourth: "on a par." Two options are on a par when neither is better than the other, they are not equally good, but the comparison is not indeterminate either. The options are qualitatively different in ways that resist ranking on a single scale but remain genuinely comparable.

The commitment claim: When alternatives are on a par, external reasons run out. No amount of additional information, analysis, or measurement can resolve the comparison. Resolution requires commitment: an act of will in which the agent stands behind one option and thereby constitutes reasons for choosing it. This act of commitment is not arbitrary (it is responsive to the values at stake) but it is not determined by those values either. The agent must put their will behind the choice.

Why this matters for agency: Chang argues that hard choices are not obstacles to rational decision-making. They are the occasions through which agents forge their evaluative identity. By committing to one option over another when the options are on a par, the agent creates something: a reason that flows from their own agency, not from the external features of the options. Over time, repeated commitment under conditions of parity is how practitioners develop professional judgment, how they become the kind of practitioner they are.

"Two Mistakes in AI Design" (2025): In this Oxford colloquium, Chang argues that AI systems embed two mistakes. First, the values-proxy assumption: AI systems represent human values through non-evaluative proxies (preferences, choices, ratings) and treat those proxies as if they were the values themselves. This assumption is axiologically guaranteed to produce long-term value misalignment, because values and their proxies come apart over time and across contexts. Second, AI systems cannot recognize parity. Because parity requires commitment (an act of will by an agent with evaluative standing), and AI systems lack that standing, AI cannot resolve the comparisons that matter most for evaluative judgment. The more decisions AI resolves through proxy metrics, the fewer occasions remain for the human commitment that builds evaluative capacity.

The "impressive short-term results" argument: Chang argues that AI's short-term proxy results will be impressive precisely because proxies are designed to be measurable and optimizable. The impressiveness is the problem. It guarantees long-term value misalignment because it eliminates the pressure to check whether the proxies track the values they are supposed to represent.

What PSF does with it

PSF uses Chang to specify what evaluative capacity erosion actually destroys at the level of practice. PSF's three dimensions (detection, judgment stock, braking) are abstract. Chang makes them concrete.

Detection fails because proxy metrics have already resolved the comparison. When a developer's merge speed and PR throughput are immediately legible, the question of whether shipping quickly or refactoring for clarity better serves the project's long-term health never surfaces as a question. The metric has already answered it. Detection requires recognizing that a genuine comparison exists. When the metric pre-resolves the comparison, there is nothing to detect.

Judgment stock depletes because the occasions that would build it no longer arise. Judgment develops through repeated confrontation with hard choices under conditions of parity. When proxy metrics eliminate those confrontations (by making one option obviously "better"), the practitioner never exercises the commitment that builds evaluative identity. A writer choosing between AI-generated fluency and the slower, rougher process through which a distinctive voice develops faces a genuine hard choice, but only if they still encounter it as a choice rather than an obvious efficiency gain.

Braking fails because there is nothing registering as a problem to brake against. The practitioner's experience is that decisions are easier, metrics are clearer, and output is higher. All signals are positive. Braking requires a signal that something is wrong. When the wrong thing (proxy substitution) looks like the right thing (productivity improvement), braking has no input.

The pincer with Paul: Chang and Paul create a theoretical pincer. Paul explains why practitioners cannot anticipate the transformation before engagement. Chang explains what the transformation eliminates after engagement. The practitioner enters unable to foresee (Paul) and exits unable to recognize what was lost (Chang). The two frameworks address opposite ends of the same temporal arc.

The design-level complement to Faulkner and Runde: Chang's values-proxy assumption provides philosophical grounding for why AI's "fertile form" systematically produces proxy-criterion divergence. Faulkner and Runde explain that form underdetermines function. Chang explains why the functions AI constitutes will systematically embed proxy values rather than the values those proxies are supposed to represent. The problem is not bad design. It is structural: values and their proxies are different kinds of things, and representing one through the other produces drift that accumulates over time.

PSF questions that remain open

Is Chang's parity concept empirically detectable in interviews? Practitioners may not use the language of parity or describe their experience in Chang's terms. The interview probes need to surface parity indirectly: "Can you describe a recent situation where you had to make a judgment call that no metric could resolve for you?" A practitioner who cannot recall such situations may be reporting their absence, which is the PSF prediction.

Does the disappearance of hard choices vary by seniority? Senior practitioners have more accumulated judgment stock (built through years of pre-AI hard choices). Junior practitioners may never encounter the hard choices that would have built their judgment. PSF predicts different signatures: seniors report that decisions feel easier (Chang territory), juniors report that they were never hard in the first place (a different, possibly more concerning, finding).

How does Chang's values-proxy assumption relate to PSF's "fertile form" claim? Both argue that AI systematically produces proxy-criterion divergence. Chang's argument is philosophical (the assumption is axiologically guaranteed to fail). PSF's argument is organizational (AI's fertile form constitutes proxies through engagement). These may be the same argument at different levels of analysis, or they may be genuinely distinct claims. Worth clarifying.

Chang's "impressive short-term results" and PSF's "self-concealing degradation." These map directly onto each other. Chang provides the philosophical warrant for why short-term proxy success guarantees long-term value misalignment. PSF provides the organizational mechanism through which that guarantee operates. The empirical phase should look for evidence of both: impressive metrics (Chang) combined with invisible erosion (PSF).

Literature move: Borrowed and applied. Chang's parity framework and values-proxy critique are applied to specify what PSF's evaluative capacity erosion actually destroys. PSF does not extend Chang theoretically. It uses her framework to make the abstract architecture of detection, judgment stock, and braking concrete at the level of practitioner experience.

Role in the sequence: Chang occupies the "what is lost" position. Once Nguyen's value capture has colonized reasoning with clear metrics, the conditions under which parity arises disappear. Hard choices vanish not because they are answered but because they are no longer encountered. With them goes the occasion for developing the judgment that would have been built by confronting them. The loss feeds forward into the next cycle (back to Paul): a practitioner whose hard-choice capacity has eroded is even less equipped to anticipate the next round of transformation.

1.4 The Feedback Loop

The Paul-Nguyen-Chang sequence is not linear. It cycles. Chang's output (eroded capacity for hard choices) feeds back into Paul's input (reduced capacity to anticipate the next transformation). Each cycle through engagement deepens the proxy seduction:

Cycle 1: The practitioner enters unable to anticipate (Paul). Proxy metrics colonize reasoning (Nguyen). Hard choices disappear from experience (Chang).

Cycle 2: The practitioner whose hard-choice capacity has eroded is even less equipped to anticipate the next round. The proxy metrics are now the baseline, not a substitution. The absence of hard choices is now normal, not a loss.

Cycle n: The criteria that would have detected the original substitution are no longer in anyone's active repertoire. The proxy has become the criterion, not through strategic choice but through the progressive erosion of the evaluative capacity that would have distinguished them.

This feedback structure is what makes proxy seduction progressive rather than static. It is also what makes it structurally different from Goodhart's Law, where the gaming agent retains the capacity to distinguish the measure from the target and simply chooses not to. In proxy seduction, the capacity to make that distinction erodes through the mechanism itself.

Division of labor: Sections 1A and 1B

Section 1 provides the individual-level phenomenology. Section 2 (Faulkner and Runde, Thornton et al., MacKenzie) provides the organizational and field-level machinery through which individual-level value capture aggregates into organizational proxy seduction. The division of labor:

LevelSectionWhat it explains
Individual practitioner: before engagement1 (Paul)Why the transformation cannot be anticipated
Individual practitioner: during engagement1 (Nguyen)How simplified metrics colonize reasoning
Individual practitioner: what is lost1 (Chang)Why hard choices disappear and judgment atrophies
Technology-organizational interface2 (Faulkner and Runde)How form underdetermines function and positioning drifts
Organizational attention and logic2 (Thornton et al.)How logics channel attention and drive asymmetric evaluation
Field-level discourse2 (MacKenzie)How discourse constitutes evaluative criteria before organizations engage

PSF's causal chain runs through both sections: Paul's anticipation failure (1) meets Faulkner and Runde's fertile form (2), which produces Nguyen's value capture (1) channeled by Thornton et al.'s logic asymmetry (2), deepened by Chang's parity elimination (1), and amplified by MacKenzie's performative discourse (2). The individual and organizational levels interleave. Neither section is sufficient on its own.

2. Four Theoretical Resources

PSF borrows from and extends four theoretical resources, each doing bounded work that no other resource does. Remove any one and the explanatory architecture has a gap. This multi-resource design is justified by the phenomenon-based theorizing methodology (Fisher, Mayer, and Morris, 2021): the phenomenon does not sit within any single literature, so the theoretical architecture must draw from multiple literatures as needed.

Paul also appears in Section 1 as the first element of the mechanism sequence. The treatment here focuses on what bounded work each resource does in the PSF architecture. Paul's entry is not repeated here (see Section 1.1).

2.1 Faulkner, P. and Runde, J. (2019) 'Theorizing the digital object', MIS Quarterly, 43(4), pp. 1279-1302.

What Faulkner and Runde argue

Faulkner and Runde identify a gap in how Information Systems research conceptualizes digital technology. Most IS work jumps from artifacts to human and organizational implications without sufficiently theorizing what digital objects are. Their theory begins from a rigorous ontology of objects and works up from there.

The key distinctions: material objects (physical things with intrinsic properties) and nonmaterial objects (syntactic objects and bitstrings). Digital objects combine material bearers (hardware, servers, physical infrastructure) with nonmaterial content (bitstrings and the syntactic objects they encode). The identity of a digital object, what it is in a social sense, flows not from its intrinsic physical properties but from its social positioning within communities of users and practices.

The form/function distinction: Form is what the technology is: its structure, architecture, capabilities, and properties considered independently of use. Function is the role the technology plays in human activity, what it does in practice. Form underdetermines function. Knowing what GPT-4 can do on a benchmark tells you something about its form. It tells you nothing reliable about its function in any particular organizational context, because function emerges through the constitution acts by which humans incorporate the technology into their practices.

Identity through social positioning: An MRI scanner acquires the social identity "MRI scanner" by being positioned within a system (a hospital, a radiology department, a diagnostic protocol) such that it occupies a social position with associated system functions. If the same device were positioned differently, it would have a different social identity and different system functions. Most human artifacts are designed with their intended position in mind, so there is usually a reasonable fit between intrinsic capacities and intended system functions. But repositioning is possible, and repositioning changes identity and function.

The fertility implication: Digital objects, AI systems in particular, have what might be called fertile form: their intrinsic capabilities support a wide and indeterminate range of possible system functions. The same underlying model can be positioned as a coding assistant, a writing tool, a decision-support system, a customer service agent, or an evaluation mechanism, and each positioning constitutes a different function. This fertility means that the range of possible functions is wider for AI than for most prior technologies, and that positioning drift through practice is more consequential.

What PSF does with it

Faulkner and Runde explain why the same AI tool produces different proxy-criterion gaps in different organizations. Function is not fixed in the technology. It emerges through positioning practices. When an organization positions AI as a productivity tool (in the market logic sense: throughput, speed, cost reduction), it constitutes functions that make market-logic metrics legible and professional-logic criteria invisible. That constitutive act is not irreversible in principle but is resistant to revision in practice because accumulated competence and institutional routines build around the constituted function.

PSF specifically uses the repositioning possibility to explain how proxy substitution deepens. Initial positioning as a coding assistant constitutes speed-of-output as a legible metric. Practice then drifts: developers prompt for code they review rather than write, which constitutes a different function (production-to-evaluation shift, per Simkute et al., 2025). The organization has not explicitly chosen to reposition the tool. Positioning has drifted through practice. The new function has different proxy-criterion relationships, but the evaluation infrastructure has not updated because the repositioning was not deliberate and therefore not visible to evaluation.

Leonardi (2011) specifies the temporal dimension of the underdetermination. Technologies simultaneously afford and constrain, and which affordances are realised depends on the routines practitioners bring to the engagement. As routines shift through repeated use, different affordances become salient, and the technology's organisational function changes without any deliberate decision to reposition it.

Leonardi and Leavell (2026) provide the cleanest empirical illustration. Two urban planning organisations used the same AI simulation tool but positioned it differently, constituting different functions. One maintained provisionality, treating AI simulation outputs as provisional inputs to planning decisions that still required professional judgment and stakeholder deliberation. The other produced what Leonardi and Leavell term "artificial certainty," presenting simulations as authoritative predictions. The same technical form, positioned differently, produced different patterns of divergence between what the organisations measured and what they were accountable for. The positioning can drift through accumulated use without any deliberate organisational decision, as practitioners habituate to new workflows and metrics consolidate around observable outputs.

Critical PSF specification: Faulkner and Runde do not discuss AI's particular form-fertility. PSF extends their framework by noting that AI's wide constitutive range makes positioning drift more consequential and less visible than with prior digital technologies. A word processor's form is sufficiently constrained that positioning drift is limited. AI's form is sufficiently open that practitioners can engage with fundamentally different functions using the same tool, on the same day, without recognizing the shift.

Connection to Section 1 (Nguyen): Faulkner and Runde's fertile form is the technology-level precondition for Nguyen's value capture. The form constitutes the proxy metrics (speed, volume, certainty) that then colonize practical reasoning through the competitive advantage Nguyen describes. Without fertile form, the proxies would need to be pre-specified by the institution (as in Nguyen's general case). With fertile form, the proxies are constituted through the engagement itself, which is potentially PSF's strongest claim to distinctiveness from Nguyen's general value capture framework.

Connection to Section 1 (Chang): Faulkner and Runde's form/function underdetermination is the structural reason why AI engagement eliminates hard choices. When the technology's form is open enough to be positioned in multiple ways, the positioning that makes metrics most legible wins (Nguyen's competitive advantage). That positioning resolves comparisons that would otherwise require commitment (Chang's parity). The technology's fertility is what makes the elimination of hard choices systematic rather than incidental.

PSF also borrows the form/function commitment methodologically: the synthesis uses "boundary activity" (function) rather than "bridge actor" (form), following directly from Faulkner and Runde's analytical priority.

PSF questions that remain open

How observable is positioning drift in practice? Faulkner and Runde theorize repositioning as a possibility. PSF claims it happens through practice without deliberate choice. The empirical phase should probe for evidence of drift: has what the tool does in your daily work changed since you started using it, and was that change something you chose or something that happened?

Is "fertile form" an original PSF construct or a straightforward application of Faulkner and Runde? The form/function distinction is theirs. The observation that AI has particularly wide constitutive range is PSF's extension. Whether this extension is a genuine theoretical contribution or simply a contextual observation about one class of digital objects needs honest assessment.

Does form-fertility vary across AI systems? A narrowly trained classification model has less fertile form than a general-purpose LLM. PSF's mechanism should operate more strongly with more fertile tools. The empirical phase could test this if sites vary in the generality of the AI tools they use.

Literature move: Borrowed and extended. PSF borrows the form/function distinction and the social-positioning account of digital object identity. PSF extends by specifying AI's fertile form as a distinct property that makes constitutive drift more consequential than Faulkner and Runde's general framework requires.

Role in the architecture: Faulkner and Runde operate at the technology-organizational interface. They explain why the same tool produces different organizational effects (positioning varies) and why those effects shift without deliberate choice (positioning drifts through practice). In the causal chain, fertile form is the bridge between Paul's anticipation failure (Section 1) and Nguyen's value capture (Section 1): the technology's openness is what constitutes the specific proxy metrics that then colonize reasoning.

2.2 Thornton, P.H., Ocasio, W. and Lounsbury, M. (2012) The Institutional Logics Perspective. Oxford University Press.

What Thornton, Ocasio, and Lounsbury argue

Thornton, Ocasio, and Lounsbury synthesize two decades of institutional logics research into a comprehensive metatheory. An institutional logic is the set of material practices and symbolic systems, including assumptions, values, and beliefs, by which individuals and organizations provide meaning to their daily activity, organize time and space, and reproduce their lives and experiences.

The framework identifies several ideal-typical institutional orders (market, corporation, professions, state, religion, family, community) and specifies the organizing principles each provides. Each logic offers distinct answers to: What is the basis of identity? What is the source of legitimacy? What is the basis of attention? What are the rules for resource allocation?

The critical mechanism: Logics are not chosen. They are inherited. Organizations are embedded in fields already structured by prevailing logics. Through socialization, professional training, industry practices, and institutional pressures, organizations absorb logics before they face any particular decision. By the time an organization evaluates AI, the logics available to it have already constrained what options it can perceive, what criteria it will apply, and what outcomes it will value.

Logics and attention: Building on Ocasio's attention-based view of the firm, Thornton et al. emphasize that logics operate by directing what organizational actors attend to. Market logic focuses attention on efficiency, throughput, and cost metrics. Professional logic focuses attention on craft quality, expertise development, and client outcomes. These attentional structures are not merely preferences. They are infrastructural: they organize what information gets gathered, what metrics get tracked, and what vocabulary is available for articulating evaluation.

Multiple logics and conflict: Most organizations operate under multiple logics simultaneously. A software development organization operates under market logic (profitability, throughput) and professional logic (code quality, engineering judgment, craft). The logics do not fully resolve. They coexist in tension. Which logic dominates resource allocation and evaluation in any given period depends on political, cultural, and institutional processes, not rational optimization.

What PSF does with it

Thornton et al. is PSF's primary theoretical conversation partner and the literature PSF most directly extends. Market logic makes AI-constituted proxy metrics legible: speed of output, volume of code generated, cost per task. Professional logic makes accountable criteria legible: code robustness, maintainability, the quality of judgment exercised in architectural decisions. AI engagement elevates market-logic metrics because the AI's speed and volume make those metrics more salient and more measurable. Professional-logic criteria require tacit judgment built through consequence exposure. They are harder to articulate and harder to measure.

The displacement is not a choice to abandon professional standards. It is an attentional drift: market logic metrics become more prominent, more frequently discussed, more directly tied to resource allocation decisions. Professional logic criteria remain technically available but recede from active evaluative use. PSF calls this salience decay: the criteria that would detect proxy-criterion divergence do not disappear from memory. They disappear from operative relevance.

The market logic/professional logic asymmetry: This is where PSF's "asymmetric ambidexterity" construct originates. Organizations do not fail at both market logic and professional logic evaluation. They succeed visibly at market logic evaluation while professional logic evaluation degrades invisibly. The asymmetry is driven by the differential legibility of the two logics' metrics under AI engagement: AI constitutes market-logic metrics as observable outcomes while rendering professional-logic criteria less accessible.

Performativity connection: Thornton et al. at the field level, combined with MacKenzie's performativity framework and Cabantous and Gond's three-mechanism taxonomy, produces PSF's account of how proxy metrics become institutionally legitimate before any individual organization tests their validity. Field-level discourse (consultant reports, frontier lab announcements, industry benchmarks) circulates market-logic AI metrics, establishing them as the evaluative vocabulary. Organizations inheriting that vocabulary enter engagement with criteria already shaped by the performative process.

Connection to Section 1 (Nguyen): Thornton et al. explain why the specific proxy metrics Nguyen's value capture predicts will dominate are market-logic metrics rather than some other simplified values. Nguyen's framework is general (any simplified metric can capture reasoning). Thornton et al. specify the institutional channel through which particular metrics gain competitive advantage: market logic makes speed, volume, and cost legible, and AI engagement makes market-logic metrics even more legible than they were before.

Connection to Section 1 (Chang): The logic asymmetry explains why hard choices disappear from organizational experience (Chang's territory). When market logic dominates attention, the parity situations Chang describes (where competing professional and market values are genuinely on a par) get resolved by default. The market-logic metric is available, clear, and defensible. The professional-logic criterion is harder to articulate. The hard choice never surfaces because one logic has already won the attention competition.

PSF questions that remain open

Is "salience decay" empirically distinguishable from deliberate deprioritization? An organization might consciously decide to prioritize speed over craft (a strategy choice, not evaluative erosion). The empirical phase needs to distinguish between organizations that have made an explicit trade-off and organizations where professional logic criteria have receded without anyone noticing. The PSF-distinctive finding is the second pattern.

Does the asymmetry hold across domains? PSF's examples are primarily software development (market logic: throughput, professional logic: craft quality). Does the same asymmetry operate in customer support (market logic: resolution speed, professional logic: diagnostic judgment)? In creative work (market logic: output volume, professional logic: originality, voice)? The empirical phase should test across domains.

Can institutional logics explain why some organizations resist proxy seduction? Thornton et al. predict that organizations with dominant professional logics and compatible AI engagement (Besharov and Smith's "aligned" configuration) should show less proxy seduction. If the empirical phase finds such cases, they would support the institutional logics channel. If professional logic dominance does not protect, PSF may need a different account of variation.

How do logics interact with the Paul-Nguyen-Chang sequence? Market logic may accelerate the sequence (making value capture faster and parity elimination more complete). Professional logic may slow it (preserving some hard choices by maintaining competing criteria). The interaction between institutional logics and the individual-level philosophical sequence is an empirical question the interview data can address.

Literature move: Borrowed and extended. Thornton et al. is the literature PSF most directly extends. The extension: institutional logics explains why certain logics dominate attentional structures. PSF explains how AI engagement systematically elevates one logic (market) while occluding another (professional) through a constitutive process that the dominant logic cannot detect because its metrics are what AI engagement produces.

Role in the architecture: Thornton et al. operate at the organizational attention level. They explain how organizations come to attend to proxy metrics rather than accountable criteria, not through a strategic choice but through the differential legibility of competing logics under AI engagement. In the causal chain, institutional logics are the organizational channel through which individual-level value capture (Nguyen, Section 1) scales: market logic amplifies proxy elevation, and the logic asymmetry ensures the degradation remains invisible.

2.3 MacKenzie, D. (2006), Callon, M. (2007), and Cabantous, L. and Gond, J-P. (2011) Barnesian Performativity

Sources:

MacKenzie, D. (2006) 'Is economics performative? Option theory and the construction of derivatives markets', Journal of the History of Economic Thought, 28(1), pp. 29-55.

Callon, M. (2007) 'What does it mean to say that economics is performative?', in MacKenzie, D., Muniesa, F. and Siu, L. (eds.) Do Economists Make Markets? On the Performativity of Economics. Princeton: Princeton University Press, pp. 311-357.

Cabantous, L. and Gond, J-P. (2011) 'Rational decision-making as performative praxis: Shedding light on the performativity of theory in the social sciences', Organization Science, 22(3), pp. 573-586.

What MacKenzie argues

MacKenzie develops a three-level taxonomy of performativity to explain how economic theories do not merely describe markets but actively reshape them. The taxonomy runs from weakest to strongest claim:

Generic performativity: a theory or model is used in practice and has some effect on the world. The Black-Scholes formula is adopted by traders and influences their behavior. This is the weakest claim: the model is "in the wild" and doing something.

Effective performativity: the model does not just get used. It actively changes economic processes in ways that would not have occurred without it. The claim is stronger: the model is consequential for outcomes.

Barnesian performativity: use of the theory makes the world more closely resemble what the theory describes. Named after sociologist Barry Barnes, who described society as "a distribution of self-referring knowledge substantially confirmed by the practice it sustains." Black-Scholes did not just describe option prices. As traders adopted it, actual option prices converged on what the formula predicted. The theory constituted reality in its own image.

The counter-performativity twist: after the 1987 crash, the Black-Scholes model became less true. Widespread adoption had produced correlated behavior that violated the model's assumptions. The model's very success generated the conditions for its failure.

What Callon adds

Callon extends MacKenzie's performativity framework by emphasizing the material and infrastructural conditions that enable performativity. Theories do not perform reality on their own. They perform through socio-technical agencements: the specific configurations of instruments, institutions, practices, and material arrangements that translate theory into action. Callon calls these configurations "agencements." The implication: performativity is not inevitable. It depends on the strength and stability of the agencements through which theories are enacted.

What Cabantous and Gond add

Cabantous and Gond show that rational decision-making is not merely described by economic theory but performed into existence through specific material and discursive practices. They identify three mechanisms through which performativity operates. Conventional performativity operates through the taken-for-grantedness of metric infrastructure: the tools, templates, and procedures that encode a theoretical framework become so routinised that their status as constructed artefacts disappears from view. Generic performativity operates through active production of the conditions the framework describes: consulting frameworks that measure AI success through throughput metrics produce client organisations that optimise against throughput, confirming the framework's validity in a self-fulfilling loop. Framing performativity operates through the field-level discourse that establishes what counts as evidence before any organisation tests the claim operationally.

What PSF does with it

MacKenzie provides the mechanism for field-level proxy constitution. AI discourse does not merely describe what AI does to organizations. At the Barnesian level, it constitutes the evaluative criteria organizations bring to engagement. When frontier labs, consultants, and industry commentators declare that AI increases throughput, reduces latency, and democratizes expertise, organizations absorbing that discourse evaluate their own AI engagement using precisely those criteria. The discourse produces outcomes that confirm it, which generates further discourse of the same kind.

PSF's distinctive contribution to the performativity literature: Standard Barnesian performativity describes how discourse shapes the phenomenon it describes. PSF specifies, through Cabantous and Gond's three-mechanism taxonomy, how the performative effects operate on the evaluation apparatus itself. AI discourse does not just constitute what organizations do with AI. It constitutes the criteria by which organizations judge whether what they are doing is working. Conventional performativity routinises the metric infrastructure until its constructed status disappears from view. Generic performativity produces the conditions the framework describes through consulting and benchmarking practices. Framing performativity establishes what counts as evidence before any organisation tests the claim. The evaluative criteria are themselves products of these three performative channels operating simultaneously. This is a claim MacKenzie's framework does not make and cannot make without PSF's constitutive mechanism.

Cabantous and Gond's taxonomy maps the structure of field-level proxy constitution precisely. Conventional performativity: vendor benchmarks, analyst reports, and maturity models become the default evaluation vocabulary, and their status as constructed artefacts disappears from view. Generic performativity: consulting frameworks that measure AI success through throughput metrics produce client organisations that optimise against throughput, confirming the framework's validity in a self-fulfilling loop. Framing performativity: field-level discourse establishes what counts as evidence before any organisation tests the claim operationally. The three mechanisms operate simultaneously, making field-level proxy seduction resistant to correction from within the field. MacKenzie supplies the constitutive logic (Barnesian performativity). Cabantous and Gond supply the channels through which that logic propagates. The three-mechanism taxonomy maps precisely onto PSF's account of executive discourse (framing), metric infrastructure (conventional), and consulting frameworks (generic).

The temporal implication: Field-level discourse operates faster than organizational engagement. Consultant reports, benchmark announcements, and frontier lab self-reports circulate proxy metrics as evaluative vocabulary before individual organizations have begun their own engagement. Organizations enter engagement with criteria already shaped by the performative process. Their "pre-engagement criteria" are not genuinely pre-engagement. They are products of the field's constitutive activity.

Counter-performativity analog for PSF: MacKenzie's counter-performativity (widespread adoption destabilizing the model) has an analog in PSF: the proxy-criterion divergence that accumulates through engagement may eventually exceed the field's capacity to suppress it. The Stack Overflow declining trust data, the DORA elite performer findings, and emerging practitioner resistance could be read as early counter-performative signals. The proxy discourse produces engagement patterns that generate empirical anomalies inconsistent with the proxy narrative. Whether those anomalies achieve field-level legibility before institutional lock-in is a question PSF's empirical phase can address.

Callon's agencements for PSF: The material infrastructure through which AI's performative effects operate includes AI benchmarks (which encode market logic, per Haupt and Brynjolfsson, 2025), consultant reports (which frame AI value in efficiency terms), industry conferences (where success narratives gain legitimacy), and vendor demonstrations (which show form performing function in controlled conditions). These are not neutral information channels. They are the material infrastructure through which particular evaluative criteria get constituted as "the way to assess AI."

Connection to Section 1 (Paul): MacKenzie's temporal implication complicates Paul's precondition. Paul argues that pre-engagement criteria are structurally unreliable because the engagement transforms the evaluator. MacKenzie adds that "pre-engagement criteria" were never genuinely pre-engagement in the first place: the field's performative discourse had already shaped them before any particular organization began engaging. The practitioner enters with criteria that feel like their own but are constituted by the field's prior activity.

Connection to Section 1 (Nguyen): MacKenzie explains why the specific proxies that capture practitioner reasoning (Nguyen) are consistent across organizations. Value capture could in principle produce different proxy metrics in different settings. But the field's performative discourse circulates a specific set of metrics (speed, throughput, cost reduction) as the vocabulary of AI success. Organizations and practitioners absorb these before engagement, which is why the same proxy metrics dominate across diverse contexts.

PSF questions that remain open

How do you empirically distinguish performative effects on the evaluation apparatus from ordinary performativity? The claim that discourse shapes evaluation criteria (not just organizational practice), operating through conventional, generic, and framing channels simultaneously, is PSF's strongest contribution to the performativity literature. But it is also the hardest to test. What observable difference would you expect between organizations whose evaluation criteria are shaped by performative discourse and organizations whose criteria are genuinely independent? The empirical phase needs a way to distinguish these.

Is counter-performativity already underway? The practitioner resistance data (Stack Overflow trust decline, DORA anomalies, Koren et al. open-source metrics decoupling) could signal counter-performative dynamics. If so, PSF's framework would predict a period where the proxy narrative and the anomalous evidence coexist uncomfortably, with field-level resolution depending on which achieves institutional legibility first. The empirical phase may be catching this transition in real time.

How does the speed differential (discourse faster than engagement) interact with the Paul-Nguyen-Chang sequence? If discourse has already constituted practitioners' evaluative criteria before engagement begins, the Paul moment (anticipation failure) is deeper than Paul alone predicts. The practitioner cannot anticipate the transformation, and they are also entering with criteria already performatively shaped. The two effects compound.

Can Callon's agencement framework identify specific intervention points? If PSF can specify which material infrastructure elements (benchmarks, conference narratives, vendor demos) most effectively constitute proxy criteria, that specification could inform practical recommendations about where to intervene in the performative circuit.

Literature move: Borrowed and extended. PSF borrows MacKenzie's Barnesian performativity, Callon's agencement framework, and Cabantous and Gond's three-mechanism taxonomy. PSF extends by specifying how performative effects operate on the evaluation apparatus, not just on organisational practice. Cabantous and Gond (2011) showed how economic models perform rational decision-making into existence. PSF extends their taxonomy to a new domain: the AI evaluation ecosystem, where conventional, generic, and framing performativity operate simultaneously to constitute proxy metrics as the legitimate evaluative vocabulary.

Role in the architecture: MacKenzie operates at the field level. He explains how discourse constitutes evaluative criteria before any individual organization engages with AI, and how the resulting consensus amplifies and stabilizes proxy metrics across the field. In the causal chain, MacKenzie's performativity is the outermost layer: it shapes the environment into which organizations enter when they begin engaging, compounds the logic asymmetry Thornton et al. describe, and reinforces the individual-level value capture Nguyen identifies. The feedback loop runs from organizational proxy success (confirmed by Thornton et al.'s market-logic metrics) back into field-level discourse (MacKenzie's performative circuit), which further entrenches the proxy criteria for the next organization to engage.

3. Evaluative Capacity Dimensions

Three evaluative capacity dimensions under investigation: detection (whether practitioners can sense divergence between proxy and criterion), judgment stock (whether practitioners have consequence-built tacit knowledge sufficient to discriminate proxy from criterion), and braking (whether evaluative institutional infrastructure can interrupt displacement once underway).

Each dimension draws on primary and supporting literature. Primary sources do bounded work that no other source does for that dimension. Supporting sources provide texture, empirical grounding, or reviewer-defense capability.

3.1 Detection

Detection is whether practitioners can sense that proxy metrics have diverged from accountable criteria. Detection failure is not ignorance. It is the structural inability to perceive a divergence when the evaluative vocabulary available does not encode the divergence as a recognizable category.

Primary Sources

Shaw, S.D. and Nave, G. (2026) 'Thinking: Fast, slow, and artificial: How AI is reshaping human reasoning and the rise of cognitive surrender', SSRN. DOI: 10.2139/ssrn.6097646.

What Shaw and Nave argue: Shaw and Nave introduce a "Tri-System Theory" that adds System 3 (AI-augmented cognition) to Kahneman's dual-process model. Through a series of controlled experiments, they document "cognitive surrender": the systematic pattern by which human reasoning defers to AI outputs regardless of accuracy. Participants adopted AI outputs on roughly 80% of faulty trials and showed inflated confidence despite errors. Accuracy became a function of AI output quality across studies, a large effect. Access to AI increased confidence by approximately 12 percentage points even when half the outputs were wrong. Shaw and Nave identify a dose-response relationship: as System 3 usage increased, participants' accuracy increasingly tracked AI accuracy rather than reflecting independent judgment. Study 3 tested recalibration conditions. Per-item financial incentives plus immediate correctness feedback caused participants to reject incorrect AI outputs at more than twice the baseline rate. Recalibration worked because three conditions held: feedback was immediate, accuracy signals were unambiguous, and consequences were personal.

What PSF does with it: Shaw and Nave provide the cognitive micro-foundation for judgment stock erosion and detection failure. Cognitive surrender describes how practitioners' criterion-level evaluation becomes coupled to AI-output quality without reliable self-awareness of that dependency, which is the individual-level instantiation of what PSF calls judgment stock erosion. Study 3 provides the basis for rejecting the "practitioners can just notice" objection: recalibration is possible but requires conditions (immediacy, unambiguity, personal consequence) that organizational settings do not provide. This supports PSF's sincere belief claim: practitioners are not choosing to ignore criterion-level evidence; their metacognitive monitoring has been suppressed by the mechanism Shaw and Nave document. The dose-response relationship maps onto engagement depth: lighter users retain more independent judgment while heavy users' accuracy increasingly tracks AI accuracy.

Ocasio, W., Laamanen, T. and Vaara, E. (2018) 'Communication and attention dynamics: An attention-based view of strategic change', Strategic Management Journal, 39(1), pp. 155-167.

What Ocasio et al. argue: Ocasio, Laamanen, and Vaara extend the attention-based view of the firm to incorporate communication as a mechanism of strategic change. Their central insight: organizational attention is not merely influenced by communication but constituted through it. The vocabularies available to managers shape what they can attend to, and therefore what they can perceive as requiring strategic response. Managers "may miss opportunities and threats that they cannot comprehend with existing vocabularies."

What PSF does with it: Ocasio et al. ground the articulation failure dimension of detection. Detection requires not just that practitioners sense proxy-criterion divergence (which tacit knowledge enables) but that they can articulate it in terms that institutional vocabularies can receive. The automation-logic vocabulary offers "productivity," "efficiency," and "time savings." These terms cannot encode "I spent three hours debugging code that looked correct but contained a subtle error I would never have introduced myself." The learning resists institutionally legible form. Practitioners have the experience and attempt to articulate it, but the articulation does not travel up to update institutional expectations because the institutional vocabulary cannot accommodate it. The result is continued engagement under declining trust. The vocabulary gap is the gap that prevents practitioner experience of proxy-criterion divergence from traveling up organizational channels as a signal that would activate the braking function.

Supporting Sources

Weber, K. and Glynn, M.A. (2006) 'Making Sense with Institutions: Context, Thought and Action in Karl Weick's Theory', Organization Studies, 27(11), pp. 1639-1660.

What Weber and Glynn argue: Weber and Glynn address a persistent criticism of Weick's sensemaking theory: that it neglects the role of larger social and historical contexts. They argue that institutions provide the raw materials for sensemaking. When organizational actors face ambiguous situations, they do not construct meaning from scratch. They draw on institutionally provided vocabularies, categories, identities, and scripts. Weber and Glynn identify three mechanisms by which institutions shape sensemaking: priming (makes certain interpretations more accessible), editing (filters out interpretations that violate institutional expectations), and triggering (activates sensemaking when institutional prescriptions conflict or fail).

What PSF does with it: Weber and Glynn supply the attentional mechanism for PSF's displacement process. Priming explains why market-logic metrics surface naturally when practitioners evaluate AI-assisted work: efficiency, throughput, and speed are the categories the institutional environment makes most accessible. Editing explains why professional-logic criteria are filtered out of upward-traveling organizational discourse: they violate the expectations of the automation-logic vocabulary that dominates the field. Triggering explains when the filtering breaks down: when proxy-criterion divergence becomes sufficiently severe to create visible institutional failure. But even triggered sensemaking draws on institutional resources; it does not escape them. PSF uses Weber and Glynn to specify the vocabulary-attention chain: institutional logics supply available vocabularies, vocabularies constrain attention, constrained attention produces institutionally shaped sensemaking, and the chain operates to render proxy-criterion divergence institutionally invisible even when practitioners have tacit experience of it.

Endsley, M.R. (2017) 'From here to autonomy: lessons learned from human-automation research', Human Factors, 59(1), pp. 5-27.

What Endsley argues: Endsley synthesizes decades of research on human-automation interaction. Her situational awareness model distinguishes three levels: perception of relevant elements in the environment (Level 1), comprehension of what those elements mean (Level 2), and projection of future states given current trends (Level 3). Automation typically handles Level 1 (perception) while humans are expected to maintain Level 2 and Level 3 awareness. The problem: Level 2 and 3 situational awareness depends on actively processing Level 1 information. If automation removes the need to perceive, comprehension and projection capacity atrophies.

What PSF does with it: Endsley's situational awareness model maps onto PSF's judgment stock dimension with specific precision: Level 2 and Level 3 situational awareness are the cognitive substrates of proxy-criterion discrimination. A practitioner who can only operate at Level 1 (perceiving AI-generated code as output) but has lost Level 2 (comprehending what the code actually does in context) and Level 3 (projecting whether it will hold up under future conditions) cannot discriminate proxy from criterion. AI engagement that removes active Level 1 construction therefore directly degrades the judgment stock that PSF identifies as the prerequisite for detection.

Endsley, M.R. (2023) 'Situation Awareness in Human-AI Systems', Journal of Cognitive Engineering and Decision Making, 17(2), pp. 87-98.

What Endsley argues: Endsley's updated analysis addresses AI-specific situational awareness challenges. Generative AI creates novel situational awareness problems because AI outputs can be plausible, detailed, and wrong in ways that are not flagged by any system indicator. Traditional automation provides clear status signals (the autopilot is engaged; the alarm has triggered). Generative AI produces fluent prose or code that looks correct while containing subtle errors. There is no status indicator for "this output is subtly wrong." The practitioner must independently evaluate the output, which requires the very situational awareness that AI assistance has not been designed to maintain.

What PSF does with it: Endsley (2023) provides specific support for PSF's detection dimension: the inability to detect proxy-criterion divergence is not just a matter of attentional drift or institutional vocabulary constraints. It is also a situational awareness problem: the outputs that most diverge from criterion-level quality are often the outputs that look most fluent and plausible, which are precisely the outputs that trigger lower-vigilance evaluation. This is why PSF's detection claim is not just organizational but cognitive: even practitioners with sufficient judgment stock may fail to apply it consistently because the proxy metric (output fluency and plausibility) triggers an evaluative mode that is calibrated to the wrong surface.

Messeri, L. and Crockett, M.J. (2024) 'Artificial intelligence and illusions of understanding', Nature.

What Messeri and Crockett argue: AI tools in scientific research generate outputs carrying the formal markers of epistemic insight without the underlying process, producing what the authors term "illusions of understanding." The formal markers (proper citations, professional prose, correctly structured analyses) have been reliable quality signals throughout practitioners' careers.

What PSF does with it: Messeri and Crockett identify a parallel structure to PSF's detection failure in scientific research. Detection difficulty compounds because the formal markers that trigger learned inferences of quality are precisely the markers AI reliably produces. Overriding that learned inference requires active epistemic confidence, the capacity to trust one's own judgment when the system's output projects authority. This is what erodes under sustained engagement.

Orlikowski, W.J. and Gash, D.C. (1994) 'Technological Frames: Making Sense of Information Technology in Organizations', ACM Transactions on Information Systems, 12(2), pp. 174-207.

What Orlikowski and Gash argue: Orlikowski and Gash develop the concept of technological frames to explain how different groups within an organization understand and engage with new technology. Technological frames are the assumptions, expectations, and knowledge people use to understand technology's nature, purpose, and value. Their central finding, drawn from a study of groupware implementation, is that different organizational groups develop different technological frames for the same technology. Technologists understood the groupware through a collaborative work frame. Managers understood it through an efficiency frame. Users understood it through a task-completion frame. Frame incongruence across groups produced implementation problems that no single group could diagnose. Frames also exhibit inertia: initial frames shape what people notice about the technology, which reinforces the frame.

What PSF does with it: Orlikowski and Gash's frame incongruence is the observable symptom that PSF predicts at the meso level: practitioners whose criteria have been constituted through direct AI engagement hold a different frame from managers whose criteria have been constituted by field-level discourse. The inertia property explains why meso-level boundary activity is insufficient to surface proxy-criterion divergence: frame inertia means the manager's automation-logic frame is self-confirming. The interview protocol for PSF's empirical phase should probe not just whether frame incongruence exists but whether the incongruence is itself a signal of proxy seduction.

Goldschmidt, G. (1991) 'The dialectics of sketching', Creativity Research Journal, 4(2), pp. 123-143.

What Goldschmidt argues: Goldschmidt studied how designers actually think while designing. She identifies a cognitive dialectic essential to productive design: the oscillation between "seeing-as" and "seeing-that." Seeing-as is categorical perception: perceiving something as an instance of a category. Seeing-that is noticing properties: perceiving what is actually on the page without immediately categorizing it. Productive design requires continuous oscillation between these modes. Seeing-as without seeing-that becomes rigid. Seeing-that without seeing-as becomes meaningless.

What PSF does with it: Goldschmidt's seeing-as/seeing-that dialectic provides a cognitive mechanism for detection failure: practitioners who have fully constituted the proxy-as-criterion frame (seeing-as: "this output is productive") lose the capacity to notice features that would prompt reinterpretation (seeing-that: "this output looks right but contains a subtle error I would not have introduced"). PSF can invoke this as a cognitive mechanism for detection failure without requiring the full premature arrest architecture.

3.2 Judgment Stock

Judgment stock is the consequence-built tacit knowledge that enables practitioners to discriminate between proxy metrics and accountable criteria. It is built through the feedback loop connecting evaluation to consequences. It is not a stable endowment but a developmental achievement that requires ongoing practice to maintain and extend.

Primary Sources

Beane, M. (2024) The Skill Code: How to Save Human Ability in an Age of Intelligent Machines. New York: Harper Business.

What Beane argues: Beane synthesizes over a decade of ethnographic research across more than 30 occupations and professions to identify what enables human skill development. His framework, the "skill code," identifies three conditions that must be present for novices to develop expertise: Challenge, Complexity, and Connection. Challenge means working near but not beyond the edge of current capability. Skill develops when people struggle with tasks that stretch them. Remove the challenge (by automating difficult tasks or by protecting novices from difficulty), and skill development stalls. Complexity means engaging with the broader system, not just isolated tasks. Connection means relationships of trust and respect between experts and novices. Skill transfer is fundamentally social. Beane's central claim: AI engagement typically degrades all three conditions. AI handles challenging tasks, removing practice opportunities. AI decomposes work into discrete tasks, reducing complexity. AI mediates or replaces expert-novice relationships, severing connection.

What PSF does with it: Beane grounds PSF's judgment stock dimension with the most direct empirical and theoretical account of how that stock is built and how it erodes. The three conditions are the developmental conditions under which consequence-built tacit knowledge (PSF's judgment stock) accumulates. When AI engagement degrades Challenge, Complexity, and Connection, it is not merely a training problem or a skill gap: it is the systematic erosion of the conditions that produce the evaluative capacity needed to detect proxy-criterion divergence. PSF uses Beane to support the temporal claim: proxy seduction does not just displace current criteria, it degrades the pipeline through which the criterion-level judgment that would detect substitution is reproduced across cohorts. The interview protocol should probe Beane's three conditions as organizational indicators of judgment stock health.

Polanyi, M. (1966) The Tacit Dimension. Garden City, NY: Doubleday.

What Polanyi argues: Polanyi developed the concept of tacit knowledge to challenge the prevailing view that all genuine knowledge could be made explicit and formalized. His famous formulation: "we can know more than we can tell." This is not merely a claim about communication difficulty. It is a claim about the structure of knowledge itself. Polanyi distinguishes focal awareness (what we attend to) from subsidiary awareness (what we attend from). When riding a bicycle, we attend to staying balanced and navigating. We attend from our sense of the handlebars, pedals, and our own shifting weight. Critically, if we try to focus on the subsidiary elements, we lose the skill. The expert cyclist who attends to their weight distribution rather than the road ahead will wobble. Tacit knowledge operates in the subsidiary dimension and resists being made focal without destroying the very competence it enables. Tacit knowledge is passed on through apprenticeship and practice, not through instruction manuals.

What PSF does with it: Polanyi grounds PSF's judgment stock dimension. Judgment stock is the accumulated tacit knowledge of practitioners who have lived through the consequences of their evaluations. It is built through the feedback loop connecting evaluation to consequences. When AI engagement shifts practitioners from producing work to reviewing AI-generated work, the subsidiary awareness that develops through production is not developed through review. The practitioner retains focal knowledge (they can articulate the evaluation criteria) while losing the subsidiary feel that makes their evaluation reliable. This is why PSF predicts that proxy seduction deepens as engagement intensifies: the tacit foundation of criterion-level judgment erodes in the subsidiary dimension without registering in any explicit quality metric.

Collins, H. (2010) Tacit and Explicit Knowledge. Chicago: University of Chicago Press.

What Collins argues: Collins extends and systematizes Polanyi's insight by distinguishing types of tacit knowledge. Relational tacit knowledge is knowledge that could in principle be made explicit but has not been. Somatic tacit knowledge is embodied in the body and brain. Collective tacit knowledge is the strongest form. It exists only in communities of practice and can only be acquired through socialization into those communities. The knowledge of how to conduct oneself as a scientist, what counts as an interesting question, a convincing argument, a legitimate method, is collective tacit knowledge. Collins argues we have "no foreseeable way to describe it fully or build machines that possess it." Collective tacit knowledge is not merely difficult to articulate: it is not the kind of thing that can be articulated. It exists in the relations between people, not in any person or document.

What PSF does with it: Collins grounds the organizational dimension of judgment stock in PSF. The proxy-criterion discrimination that PSF identifies as the key evaluative capacity is largely collective tacit knowledge: the standards for distinguishing code robustness from code that merely passes tests exist in professional communities and are acquired through socialization. As AI engagement reduces shared practice, the community interactions through which collective tacit knowledge sustains and transmits are weakened. Proxy seduction thus has a second-order effect on judgment stock: it does not merely shift what practitioners attend to, it erodes the community infrastructure through which the criterion-level judgment is reproduced.

Supporting Sources

Collins, H. and Kusch, M. (1998) The Shape of Actions: What Humans and Machines Can Do. Cambridge, MA: MIT Press.

What Collins and Kusch argue: Collins and Kusch distinguish two fundamental types of human action. Mimeomorphic actions are those that actors try to carry out "in the same way" across similar situations. The action has a correct form that can be specified, demonstrated, and replicated. Polimorphic actions vary with social context in ways that cannot be fully specified in advance. They require judgment about what counts as "the same situation" and what response is appropriate.

What PSF does with it: The mimeomorphic/polimorphic distinction grounds PSF's account of what is lost when judgment stock erodes. AI-constituted proxy metrics evaluate mimeomorphic surfaces: does the output meet the specified criteria, pass the tests, achieve the throughput targets. The accountable criteria that proxy seduction displaces are largely polimorphic: does the output reflect the kind of situated judgment that distinguishes expert from novice. Proxy metrics can measure the mimeomorphic surface reliably. They cannot detect the erosion of polimorphic capacity because that capacity was never measurable by the instruments that make proxy metrics attractive.

Collins, H. (2018) Artifictional Intelligence: Against Humanity's Surrender to Computers. Cambridge: Polity Press.

What Collins argues: Collins acknowledges that AI is better at mimicking social competence than he initially anticipated, but maintains that without embodiment and membership in a human community, AI cannot possess genuine social tacit knowledge. The machine can produce outputs that look polimorphic without being polimorphic in the generative sense. Collins calls this "faking" socialness: a high-level statistical mimicry that reproduces the shape of social action without participating in the social life that gives that shape its meaning. Collins poses the "behavioral bridge" challenge: if the mimicry is convincing to observers, does the philosophical distinction matter in practice?

What PSF does with it: PSF's contribution is to explain how organizations come to treat mimicry as equivalent to situated judgment through the mechanism of proxy seduction, not through a general evaluative failure. The proxy metric (individual output quality, user satisfaction ratings) registers mimicry as satisfactory because it evaluates the mimeomorphic surface. The accountable criterion (collective diversity, polimorphic contribution to the portfolio) is what erodes. PSF adds to Collins's framework the institutional mechanism through which this failure mode is reproduced and stabilized: once market-logic metrics constitute AI-generated outputs as "productive," professional-logic criteria for detecting the mimicry gap become organizationally illegible.

Dreyfus, H.L. and Dreyfus, S.E. (1986) Mind Over Machine: The Power of Human Intuition and Expertise in the Era of the Computer. New York: Free Press.

What Dreyfus and Dreyfus argue: Dreyfus and Dreyfus argue that human expertise develops through stages that resist mechanization. Beginners follow rules; advanced beginners recognize context-specific features; competent practitioners adopt priorities and plans; proficient practitioners see situations holistically; experts act intuitively from a vast repertoire of experience-based patterns. Crucially, expertise emerges from thousands of hours of situated practice with feedback. It cannot be accelerated through instruction because it is built from the accumulated experience of being wrong and learning from it.

What PSF does with it: Dreyfus and Dreyfus ground the temporal dimension of PSF's judgment stock claim: judgment stock is not a stable endowment but a developmental achievement that requires ongoing practice to maintain and extend. If AI engagement freezes practitioners at intermediate developmental stages by removing the challenge and complexity through which they would advance, then the judgment stock available for proxy-criterion discrimination is permanently capped below the expert level. This supports PSF's cohort claim: organizations that heavily engage AI among early-career practitioners may produce cohorts that never develop the senior-level judgment stock that would make detection of proxy-criterion divergence possible.

Bainbridge, L. (1983) 'Ironies of Automation', Automatica, 19(6), pp. 775-779.

What Bainbridge argues: Bainbridge identified a fundamental paradox in automated systems: automation intended to eliminate human error paradoxically creates new forms of human error. Two core ironies. First, as automation handles routine operations, human operators become monitors rather than active controllers. Their active skills atrophy through disuse. Yet they are expected to intervene competently during rare system failures when their skills have most degraded. Second, designers who automate routine tasks often leave the most difficult tasks to humans, precisely the ones that are most resistant to automation.

What PSF does with it: Bainbridge grounds the judgment stock erosion argument empirically. PSF's claim is that AI engagement degrades the feedback loop through which judgment stock is built and maintained. Bainbridge shows this pattern is well-documented in adjacent domains (industrial control, aviation) before it appears in knowledge work. The irony of automation is a specific form of PSF's mechanism: operators are monitoring the system that has replaced their active skill, but the monitoring does not reproduce the skill because it lacks the consequence-exposure loop that built the skill in the first place.

Beane, M. (2019) 'Shadow Learning: Building Robotic Surgical Skill When Approved Means Fail', Administrative Science Quarterly, 64(1), pp. 87-123.

What Beane argues: Beane's ethnographic study of robotic surgery training reveals how novice surgeons, denied legitimate access to challenging cases by supervision protocols, developed skills "in the shadow" of the system: finding workarounds to get the practice they needed. Protecting novices from challenge also protects them from skill development.

What PSF does with it: Shadow learning is relevant to PSF's empirical design: the interview protocol should probe whether practitioners with high judgment stock in AI-intensive environments maintain their stock through shadow learning activities (side projects, open source work, independent practice) rather than through their primary organizational work. In AI knowledge work, shadow learning is harder to access than in Beane's surgical context because the challenge has been removed at the task level, not just the supervision level.

Friis, O.V. and Riley, J. (2024) 'Automation and the Loss of Competence: Theoretical Perspectives', Journal of Applied Psychology. (In press)

What Friis and Riley argue: Friis and Riley review theoretical foundations for competence loss under automation, identifying three distinct mechanisms: skill degradation through disuse (the use-it-or-lose-it mechanism), skill non-acquisition among new entrants (the never-built-it mechanism), and metacognitive miscalibration (the overconfidence mechanism). These three mechanisms interact: disuse reduces active skill, non-acquisition reduces the cohort baseline, and overconfidence prevents the recognition that skill levels have changed.

What PSF does with it: Friis and Riley's three mechanisms map directly onto PSF's judgment stock erosion across three temporal horizons. Disuse degradation maps onto current practitioners who are losing the feedback loop that maintained their judgment stock. Non-acquisition maps onto entering practitioners who are building judgment stock in AI-intensive environments that never provide the consequence-exposure required for full development. Metacognitive miscalibration maps onto Shaw and Nave's cognitive surrender: practitioners whose judgment stock has degraded do not know it has degraded because the recalibration mechanism (consequence feedback) has been disrupted by AI engagement.

Simkute, A., McAulay, D. and Sellen, A. (2025) 'The Absent Expert: Shifting Roles in AI-Assisted Design', CHI Conference Proceedings.

What Simkute et al. argue: Simkute, McAulay, and Sellen document a systematic shift in designer roles when AI design tools are introduced: from active production to passive evaluation. The shift is rapid and often unreflective. Designers move into evaluation roles without explicit decision or discussion. Production-mode competencies (generative facility, exploratory thinking, material fluency) are exercised less while evaluation-mode competencies (critical assessment, selection criteria, feedback articulation) are exercised more.

What PSF does with it: The production-to-evaluation shift is the practice-level observable through which PSF's judgment stock erosion mechanism operates in knowledge work: practitioners shift from producing to evaluating, and the productive competencies that sustain judgment stock (the tacit feel for craft, the material fluency, the generative exploration) are no longer practiced. The interview protocol should probe this shift as an early observable indicator of judgment stock degradation.

Shen, J.H. and Tamkin, A. (2026) 'How AI impacts skill formation', arXiv.

What Shen and Tamkin argue: In a randomised experiment, AI-assisted developers scored 17% lower on comprehension assessments than unassisted developers. The interaction patterns developers reported as most productive were the ones that prevented learning.

What PSF does with it: Shen and Tamkin quantify the divergence between the proxy (task completion speed) and the criterion (understanding of the code produced). The finding that the most productive-feeling patterns are the most learning-preventing patterns is a direct instantiation of PSF's self-concealing mechanism. Proxy seduction operates through the very practices that feel most effective.

Bastani, H., Bastani, O. and Sungu, A. (2025) 'Generative AI without guardrails: Metacognitive decoupling in AI-assisted learning.'

What Bastani et al. argue: Students using standard ChatGPT scored 17% worse on unassisted exams while reporting confidence in learning that did not occur. The felt-learning proxy and the actual-learning criterion decouple, and the decoupling is invisible to the learner.

What PSF does with it: Bastani et al. confirm the metacognitive dimension of proxy seduction. The confidence-without-competence pattern maps directly onto PSF's detection failure mechanism: practitioners believe they are learning (the proxy) while their unassisted capability erodes (the criterion). Confirms Shen and Tamkin's non-formation finding in a different context.

3.3 Braking

Braking is whether evaluative institutional infrastructure can interrupt proxy displacement once underway. Braking failure means not that organizations lack the formal capacity to intervene, but that the signals that would trigger intervention are themselves products of the proxy evaluation apparatus.

Primary Sources

Argyris, C. (1990) Overcoming Organizational Defenses: Facilitating Organizational Learning. Boston: Allyn and Bacon.

What Argyris argues: Argyris spent decades studying why organizations fail to learn from experience. His answer: defensive routines. Organizations develop systematic practices for avoiding threatening information, protecting existing beliefs, and preventing embarrassment. Defensive routines are "any policy, practice, or action that prevents organizational participants from experiencing embarrassment or threat and, at the same time, prevents them from discovering the causes of the embarrassment or threat." They are doubly dangerous: they prevent learning, and they prevent recognition that learning is being prevented. Argyris identifies "skilled incompetence": the ability to produce precisely the defensive behaviors that prevent learning while believing oneself to be acting rationally and constructively. Defensive routines are self-sealing: attempts to discuss them trigger more defensiveness.

What PSF does with it: Argyris grounds PSF's braking dimension directly. Braking refers to whether organizational evaluative infrastructure can interrupt proxy displacement once underway. Defensive routines are the mechanism through which braking fails: the same organizational infrastructure that would be needed to surface proxy-criterion divergence is captured by the defensive routines protecting the AI investment narrative. PSF's distinctive addition to Argyris is the constitutive mechanism: it is not merely that defensive routines protect bad decisions after the fact, but that AI engagement has already reconstituted what counts as a good decision by constituting proxy metrics as the operative evaluation vocabulary. Argyris's defensive routines then protect the constituted reality, not just a prior choice.

Weick, K.E. (1995) Sensemaking in Organizations. Thousand Oaks, CA: Sage.

What Weick argues: Weick establishes sensemaking as a distinct process by which organizations construct actionable understanding from ambiguous situations. Sensemaking is not decision-making or interpretation. It is the prior process of constructing the situation that will then be interpreted and decided upon. Seven properties, with the plausibility criterion being crucial: organizations do not have the time or cognitive resources to verify accuracy. They settle for accounts that hang together, that fit with what they already believe, that enable action.

What PSF does with it: Weick contributes to PSF's account of why the braking dimension fails to activate. The proxy narrative is plausible: it is consistent with direct observable experience (individual outputs are faster and often cleaner), socially validated within organizations (colleagues report similar improvement), and consistent with field-level discourse (vendor claims, consultant reports, benchmark results). The criterion-level narrative (we are substituting proxy metrics for accountable criteria and our judgment is eroding) is implausible by Weick's criteria: it is not directly observable, not socially validated, and inconsistent with the dominant field narrative. Sensemaking will stabilize the plausible account and suppress the implausible one, which is precisely what PSF predicts: the felt experience of improvement drives the displacement, and the displacement is sustained by sensemaking processes that favor the market-logic account.

March, J.G. (1991) 'Exploration and Exploitation in Organizational Learning', Organization Science, 2(1), pp. 71-87.

What March argues: March identifies a fundamental tension between exploration (developing new knowledge, capabilities, options) and exploitation (refining existing competencies). Both are essential. They compete for scarce resources. Exploitation involves refinement, efficiency, selection, execution. Its returns are relatively certain, proximate, and easy to measure. Exploration involves search, variation, risk-taking, discovery. Its returns are uncertain, distant, and hard to measure. Organizations tend toward exploitation because its returns are more visible and certain. An organization that focuses on exploitation will improve short-run performance but become increasingly obsolete.

What PSF does with it: March is load-bearing in PSF for the asymmetric ambidexterity construct. PSF's claim is that organizations do not fail at both exploitation and exploration. They succeed visibly at exploitation while exploration degrades invisibly. AI engagement constitutes market-logic metrics (exploitation observables) as the evaluative vocabulary while rendering professional-logic criteria (exploration capacity) invisible. March's exploitation/exploration tension provides the organizational learning foundation for this asymmetry: the self-reinforcing property of exploitation means that once market-logic metrics are constituted as the operative vocabulary, the resources, attention, and institutional support that would maintain exploration-oriented professional logic criteria are progressively diverted. PSF uses March to show that asymmetric ambidexterity is not an accident or a manageable tradeoff but a structural tendency of organizational learning that AI engagement amplifies.

Supporting Sources

Levitt, B. and March, J.G. (1988) 'Organizational Learning', Annual Review of Sociology, 14, pp. 319-340.

What Levitt and March argue: Organizations encode lessons from experience into routines, but the encoding process is subject to distortions. Superstitious learning occurs when organizations draw incorrect causal inferences from experience. Competency traps occur when favorable performance with an existing procedure leads to accumulated experience that reinforces commitment, even when a superior alternative exists.

What PSF does with it: Levitt and March's superstitious learning and competency trap mechanisms are the organizational learning pathways through which proxy seduction becomes institutionally locked in. Superstitious learning explains how the proxy narrative gets encoded as organizational knowledge: the organization attributes aggregate output improvements to AI engagement without recognizing that the improvement reflects proxy-metric gains rather than criterion-level outcomes. Competency traps explain why proxy seduction is difficult to reverse even after detection: the organization has built routines, skills, and resource allocation patterns around AI-assisted workflows.

Marquis, C. and Lounsbury, M. (2007) 'Vive la Résistance: Competing Logics and the Consolidation of Community Banking', Academy of Management Journal, 50(4), pp. 799-820.

What Marquis and Lounsbury argue: They provide an empirical demonstration of how competing institutional logics produce divergent organizational responses to identical environmental pressures. Their study of community banking shows that logic prevalence in local contexts predicted outcomes better than economic variables. Two banks with identical financial profiles might have opposite fates depending on which logic dominated their community. Once a logic prevailed, it became self-reinforcing.

What PSF does with it: Marquis and Lounsbury contribute to PSF's account of the braking failure: when proxy seduction has elevated market logic to dominance, professional logic criteria cannot reassert themselves through evidence because logic conflicts resolve through power rather than accuracy. This supports PSF's claim about the structural difficulty of detection: even when organizations have practitioners with sufficient judgment stock to detect proxy-criterion divergence, the institutional political dynamics may prevent the detection from becoming organizationally actionable.

4. Supporting and Contextual Literature

These sources do background, reviewer-defense, or empirical-design work without being primary PSF-load-bearing sources. They are grouped thematically. Entries retain the full "What X argues" treatment for reference value but note their specific PSF function.

4.1 Practice-Constituted Evaluation

Nicolini, D. (2012) Practice Theory, Work, and Organization: An Introduction. Oxford: Oxford University Press.

What Nicolini argues: Nicolini provides a comprehensive synthesis of practice-theoretical approaches to organization. Practices are the fundamental unit of analysis for understanding social and organizational life. They are organized constellations of activities that hang together because they share understandings, rules, and teleoaffective structures. Knowledge exists in practice, not before it. Practices are materially mediated.

What PSF does with it: Nicolini's practice theory contributes the deep ontological grounding for PSF's claim that proxy seduction operates through sincere belief. If evaluation criteria are practice-constituted, then practitioners who have practiced AI-assisted work have genuinely constituted different criteria through that practice. They are not misremembering or strategically misrepresenting their standards; their standards have been reconstituted through the practice itself.

Schatzki, T.R. (2001) 'Introduction: Practice theory', in Schatzki, T.R., Knorr Cetina, K. and von Savigny, E. (eds.) The Practice Turn in Contemporary Theory. London: Routledge, pp. 10-23.

What Schatzki argues: Schatzki identifies three elements that hold a practice together: practical understandings (know-how), rules (explicit formulations), and teleoaffective structures (the shared sense of what the practice is for, what counts as success, what emotions are appropriate). Teleoaffective structures are collective and largely implicit, absorbed through participation rather than taught through instruction.

What PSF does with it: Schatzki's teleoaffective structures provide the practice-theoretical grounding for PSF's salience decay construct. What has changed is not the practical understandings (practitioners can still articulate what good code looks like) or the explicit rules (code review requirements have not changed), but the teleoaffective structure: the shared sense of what software development is for, what counts as satisfying work. When AI engagement shifts this from "build and understand" to "direct and review," the criteria that were operative under the old structure remain articulable but no longer feel like the right way to assess work.

Orlikowski, W.J. (2007) 'Sociomaterial practices: Exploring technology at work', Organization Studies, 28(9), pp. 1435-1448.

What Orlikowski argues: Drawing on Karen Barad's agential realism, Orlikowski takes an onto-epistemological position: the social and material do not interact as separate things. They constitute each other in practice. AI capabilities do not exist apart from organizational practice. There is no "AI in itself" to evaluate. AI only exists as AI-in-practice.

What PSF does with it: PSF uses Faulkner and Runde's form/function distinction rather than agential realism from Orlikowski, because Faulkner and Runde give more analytic traction for specifying what organizations get wrong (form/function conflation) and why (function emerges through practice). Orlikowski's contribution to PSF is primarily to ground the claim that AI-in-practice differs from AI-in-prospect.

Barad, K. (2007) Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning. Durham, NC: Duke University Press.

What Barad argues: The observer and observed are not separate things that exist before observation and then come into contact. They come into being together through the act of observing. The measurement apparatus participates in creating what gets measured. Boundaries between observer and observed are not pre-given. They are enacted through "agential cuts."

What PSF does with it: Barad remains relevant as background grounding for the constitutive claims PSF makes, particularly the claim that measuring proxy metrics makes the organization into something that produces proxy-metric results. If reviewers challenge PSF's constitutive claims on ontological grounds, Barad provides the deeper philosophical warrant.

Kellogg, K.C., Valentine, M.A. and Christin, A. (2020) 'Algorithms at Work: The New Contested Terrain of Control', Academy of Management Annals, 14(1), pp. 366-410.

What Kellogg et al. argue: They provide a comprehensive review of algorithmic management, identifying six mechanisms of algorithmic control: Restricting, Recommending, Recording, Rating, Replacing, and Rewarding. Rating algorithms do not merely measure performance: they constitute what "performance" means. Once the algorithm defines performance, alternatives become invisible.

What PSF does with it: Kellogg et al. is load-bearing for the institutional logics displacement mechanism at the organizational level. The six mechanisms describe how AI-mediated algorithmic control reconstitutes what "performance" means through Rating and Recording. The AI tools organizations use track, record, and rate against market-logic metrics (speed, throughput, completion rates), and that tracking constitutes those metrics as the operative definition of performance. This is Barnesian performativity operating at the organizational level.

4.2 Identity and Sensemaking

Albert, S. and Whetten, D.A. (1985) 'Organizational identity', Research in Organizational Behavior, 7, pp. 263-295.

What Albert and Whetten argue: Organizations have self-definitions that function like individual identity, centering on features seen as central, distinctive, and enduring. Identity claims are not mere descriptions but commitments that shape what the organization can perceive and do.

What PSF does with it: Proxy seduction operates through genuine identification with the new metrics, not strategic misrepresentation. Practitioners who have built their professional identity around "building sophisticated systems" may sincerely adopt "managing AI-generated outputs" as equivalent because the identity label persists even as the underlying practice reconstitutes.

Whetten, D.A. (2006) 'Albert and Whetten Revisited: Strengthening the Concept of Organizational Identity', Journal of Management Inquiry, 15(3), pp. 219-234.

What Whetten argues: Identity claims are performative: they do not just describe the organization but partially constitute it by creating accountability structures.

What PSF does with it: Once an organization publicly commits to being "AI-forward" or "AI-native," that identity claim creates accountability structures that resist disconfirming evidence. The braking dimension of evaluative capacity is weakened when identity commitments are at stake.

Gioia, D.A., Schultz, M. and Corley, K.G. (2000) 'Organizational identity, image, and adaptive instability', Academy of Management Review, 25(1), pp. 63-81.

What Gioia et al. argue: Identity maintains apparent continuity through stable labels while the meanings of those labels shift. An organization that has always been "innovative" may mean something quite different by "innovation" now than it did twenty years ago. The label persists; the substance changes. This allows organizations to adapt while maintaining a sense of continuity.

What PSF does with it: Gioia et al.'s adaptive instability is an important PSF mechanism for explaining how proxy seduction persists without detection. PSF's construct of salience decay operates through the adaptive instability mechanism: practitioners keep applying the label "quality" or "good code" while the substantive meaning behind those labels quietly reconstitutes around AI-constituted proxies. The label stability masks the criterion drift. This is not strategic deception; it is the normal adaptive process, operating on evaluative vocabulary. Adaptive instability is one of the micro-mechanisms through which PSF's sincere belief claim is supported.

Nag, R., Corley, K.G. and Gioia, D.A. (2007) 'The intersection of organizational identity, knowledge, and practice', Academy of Management Journal.

What Nag et al. argue: Organisational identity is constituted through knowledge practices, not merely declared through labels. The practices through which an organisation enacts its core work are what give meaning to identity claims.

What PSF does with it: Nag et al. establish the identity-practice constitution link that PSF relies on for the salience decay account. When AI engagement changes the practices through which craft is enacted, the identity migrates even as the label holds. A development team that still calls itself "craftspeople" may no longer mean what it originally meant by "craft." From inside the organisation, the migration from professional to market logic feels like continuity rather than change. Paired with Gioia et al.'s adaptive instability: the label persists while the practice reconstitutes.

4.3 Boundary Dynamics and Methodology

Barrett, M., Oborn, E., Orlikowski, W.J. and Yates, J. (2012) 'Reconfiguring Boundary Relations: Robotic Innovations in Pharmacy Work', Organization Science, 23(5), pp. 1448-1466.

What Barrett et al. argue: Ethnographic study of pharmaceutical-dispensing robots showing complex, contradictory boundary reconfiguration. The same technology produced different effects for different groups. What elevated pharmacists' clinical role simultaneously deskilled assistants' dispensing work. Boundaries were reconfigured through distributed, emergent, and situational adjustments, not through designated intermediaries.

What PSF does with it: Methodologically foundational for PSF's empirical design through the boundary activity reframing. Asking "who are the bridge actors?" would have missed much of what Barrett observed. Asking "where does boundary activity occur?" captures the distributed, practice-constituted nature of the phenomenon. PSF extends by asking specifically what happens to the proxy-criterion relationship during boundary reconfiguration.

Carlile, P.R. (2004) 'Transferring, Translating, and Transforming: An Integrative Framework for Managing Knowledge Across Boundaries', Organization Science, 15(5), pp. 555-568.

What Carlile argues: Three progressively complex boundary types: syntactic (information transfer suffices), semantic (translation required), and pragmatic (transformation necessary because interests and practices have diverged). Novelty determines which type is operative.

What PSF does with it: Carlile's progression maps onto the depth of proxy-criterion divergence: early in engagement, the gap can be surfaced through translation. As engagement deepens and proxy metrics become the constituted vocabulary, the gap shifts to pragmatic: surfacing it would require managers to transform their evaluative framework. This is why proxy seduction deepens over time: the boundary conditions for correction become progressively more demanding while the organizational resources for pragmatic boundary work are simultaneously eroding.

Pickering, A. (1995) The Mangle of Practice: Time, Agency, and Science. Chicago: University of Chicago Press.

What Pickering argues: Knowledge emerges through iterative cycles of resistance and accommodation between human intentionality and material agency. In AI contexts, resistance operates bilaterally: AI resists human intention through adaptive outputs, and AI also accommodates human inputs. The human tunes to the AI; the AI tunes to the human. Stabilization is not just difficult but conceptually unclear: what would it mean to "finish tuning" when the thing you are tuning to is itself tuning to you?

What PSF does with it: Pickering's mangle grounds the temporal dimension of proxy seduction. Bilateral tuning means proxy metric constitution is not a one-time event but an ongoing process. As practitioners tune to AI outputs, their sense of what constitutes "good enough" is continuously recalibrated by what AI reliably produces, which is precisely the proxy-metric surface that market logic makes legible.

Oborn, E., Barrett, M., Orlikowski, W.J. and Kim, A. (2019) 'Trajectory Dynamics in Innovation', Organization Science, 30(5), pp. 1097-1123.

What Oborn et al. argue: Innovations are not fixed objects but trajectories, ongoing processes of development and transformation. Four patterns: separation, coordination, diversification, and integration.

What PSF does with it: Organizations where AI engagement follows the integration pattern are most susceptible to PSF's mechanism because integration involves the most thorough reconstitution of evaluative criteria.

Faik, I., Barrett, M. and Oborn, E. (2020) 'How Information Technology Matters in Societal Change: An Affordance-Based Institutional Logics Perspective', MIS Quarterly, 44(3), pp. 1359-1390.

What Faik et al. argue: Logics shape which affordances actors perceive and how they actualize them. They identify three mechanisms: sensegiving, translating, and decoupling.

What PSF does with it: Faik et al.'s affordance-institutional logics integration is the closest existing framework to PSF's account of how market logic constitutes AI's proxy-metric affordances as the operative ones while rendering professional-logic affordances imperceptible. The decoupling mechanism is directly relevant to the Stack Overflow data: practitioners adopting AI while experiencing proxy-criterion divergence they cannot surface institutionally.

Barrett, M. and Orlikowski, W. (2021) 'Scale matters: Doing practice-based studies in the digital world', MIS Quarterly, 45(1b), pp. 467-472.

What Barrett and Orlikowski argue: Digital technologies complicate traditional notions of "the local" because digital practices are inherently multi-scalar.

What PSF does with it: PSF's empirical design needs to trace multi-scalar dynamics: practitioner-level criterion drift, organizational-level logic displacement, field-level Barnesian performativity. A single-site study will miss the field-level dynamics; a field-level study will miss the practitioner-level judgment stock erosion.

4.4 Organizational Learning (Incumbent Frameworks PSF Challenges)

Teece, D.J. (2007) 'Explicating dynamic capabilities', Strategic Management Journal, 28(13), pp. 1319-1350.

What Teece argues: The dynamic capabilities framework holds that firms can sense opportunities and threats, seize value, and transform themselves. A key assumption: firms can sense environmental changes accurately.

What PSF does with it: PSF challenges this assumption for transformative AI. If AI engagement transforms sensing capacity itself, the dynamic capabilities framework cannot operate as theorized. The "dynamic capability" organizations believe they are building may be a proxy capability constituted by the engagement rather than a genuine enhancement of evaluative capacity.

Cohen, W.M. and Levinthal, D.A. (1990) 'Absorptive Capacity', Administrative Science Quarterly, 35(1), pp. 128-152.

What Cohen and Levinthal argue: An organization's ability to recognize the value of new external knowledge, assimilate it, and apply it is a function of its prior related knowledge. Absorptive capacity is path-dependent and cumulative, and it contains an investment paradox: you need absorptive capacity to recognize the value of investing in absorptive capacity.

What PSF does with it: If AI engagement degrades the conditions under which prior related knowledge develops, then absorptive capacity erodes across generations. The self-inflicted lockout dynamic: organizations create the conditions for their own future inability to absorb. The practitioners who would detect proxy-criterion divergence are those with the highest judgment stock, but AI engagement systematically reduces the developmental conditions through which judgment stock builds, so detection capacity degrades across cohorts.

O'Reilly, C.A. and Tushman, M.L. (2013) 'Organizational Ambidexterity', Academy of Management Perspectives, 27(4), pp. 324-338.

What O'Reilly and Tushman argue: Ambidexterity is the organizational capacity to simultaneously pursue exploitation and exploration. Resource allocation between units requires accurate assessment of relative opportunity.

What PSF does with it: PSF's asymmetric ambidexterity is positioned against this literature. Ambidexterity frameworks cannot solve proxy seduction by better structural design because the measurement problem is upstream of structural choices: organizations cannot balance what they cannot see.

5. Evidence Constellation

PSF operates at the organizational level, explaining why organizations systematically misjudge AI engagement outcomes and fail to self-correct. Most available evidence comes from individuals. This is not a limitation. It is the nature of the phenomenon. What organizations do is pattern individual experiences, stabilize them through routines and identity, and make them invisible through aggregation and sensemaking.

The evidence varies in epistemic status. Direct evidence measures perception-reality gaps with methodological rigor. Mechanism hypotheses are borrowed from Human Factors, to be tested. Field-level patterns are visible only across organizations. Gray literature and field notes illustrate the phenomenon the inquiry is trying to explain, not evidence for the theory itself.

5.1 Direct Evidence

METR (2025) 'Measuring the Impact of Early-2025 AI Models on Experienced Open-Source Developer Productivity.'

METR conducted a pre-registered randomized controlled trial with 16 experienced open-source developers working on their own repositories. Developers expected AI tools would make them 24% faster. They were actually 19% slower. The 39-point perception-reality gap occurred among practitioners who should have been ideally positioned for accurate self-assessment: experienced, working on their own code, with every reason to assess accurately.

PSF frames METR as the signature instance of proxy seduction: practitioners whose judgment stock has been constituted through years of direct consequence exposure construct plausible accounts of AI-assisted speed (the proxy metric: throughput) while missing the professional-logic costs (actual task completion time). The gap is not random noise; it is what Barnesian performativity produces when the field-level AI productivity narrative has constituted speed-of-output as the operative evaluative category before objective measurement has been applied. This is the anchor finding. If sophisticated practitioners with aligned incentives cannot perceive AI's effects accurately, something structural is happening that individual judgment and experience cannot overcome.

Daniotti, L., Impink, S., Perrone, G., Tangi, L. and Traverso, S. (2026) 'Generative AI and Developer Productivity: Evidence from GitHub Copilot', Science. (In press)

Daniotti and colleagues analyzed 31 million commits from 160,097 developers over an extended period, using a within-developer panel design. Early-career developers used AI-generated code in 37% of their work compared to 27% for senior developers. Yet only senior developers showed productivity gains (6.2% increase). Early-career developers showed no measurable productivity improvement despite substantially higher AI usage.

Daniotti et al. directly support PSF's judgment stock mechanism. Senior developers have higher judgment stock (consequence-built tacit knowledge from years of building and maintaining systems) and can detect when AI-generated code produces the proxy metric (it looks right, passes tests) while missing the criterion (it is subtly wrong in ways that will cause problems). Early-career developers lack the judgment stock to make this discrimination, so they cannot convert AI assistance into criterion-level quality improvement even at higher usage rates. Science publication confirms the pattern at scale with strong methodology.

Dell'Acqua, F., McFowland, E. III, Mollick, E.R., Lifshitz-Assaf, H., Kellogg, K.C., Rajendran, S., Krayer, L., Candelon, F. and Lakhani, K.R. (2026) 'Navigating the jagged technological frontier: Field experimental evidence of the effects of artificial intelligence on knowledge worker productivity and quality', Organization Science, Articles in Advance. DOI: 10.1287/orsc.2025.21838.

758 BCG consultants in a pre-registered experiment. For tasks within AI's capability frontier: substantial gains. For a task outside AI's capability boundary: AI users performed 19 percentage points worse than controls. Consultants could not reliably identify which tasks fell inside versus outside the frontier.

PSF reads the jagged frontier as the spatial structure of the proxy-criterion gap: at any given point in an organization's AI engagement, there is a frontier between domains where AI-constituted proxy metrics track accountable criteria reasonably well and domains where they diverge. The critical finding for PSF is that practitioners cannot perceive this frontier. This is not a knowledge deficit; it is what PSF predicts when market-logic metrics have been constituted as the operative evaluative vocabulary: the criteria that would reveal frontier boundaries are not the criteria being applied. The OS paper reframes Dell'Acqua et al.'s contribution: their framing treats the frontier as a property of the technology rather than asking whether the engagement reshapes the practitioner's capacity to locate the frontier.

Brynjolfsson, E., Li, D. and Raymond, L. (2025) 'Generative AI at Work', Quarterly Journal of Economics. (In press)

5,172 customer service agents, staggered adoption design. Average productivity increased 14%. Novice agents gained 34-35%. Expert agents showed minimal speed gains and slight quality declines. Workers could not revert to pre-engagement performance levels during system outages. High-skill workers increased adherence to AI suggestions even as quality declined.

Brynjolfsson et al. provide three specific PSF contributions. First, the novice-expert inversion supports PSF's judgment stock account: high judgment stock enables expert practitioners to detect quality decline but not to act on the detection because the institutional infrastructure (adherence to AI suggestions) has already been constituted by Barnesian performativity at the organizational level. Second, the irreversibility during outages is evidence of constitutive transformation in Paul's sense. Third, the adherence-increasing-as-quality-declines pattern is direct behavioral evidence of proxy seduction: practitioners are optimizing against the proxy metric (AI adherence rate, which management can observe) while the accountable criterion (resolution quality) erodes unmeasured.

Leonardi, P.M. and Leavell, V. (2026) 'Knowing enough to be dangerous: The problem of "artificial certainty" for expert authority when using AI for decision making and planning', Organization Science, Articles in Advance. DOI: 10.1287/orsc.2023.18224.

Two urban planning organisations used the same AI simulation tool but positioned it differently. One maintained provisionality, treating AI outputs as provisional inputs requiring professional judgment and stakeholder deliberation. The other produced "artificial certainty," presenting simulations as authoritative predictions. The same technical form, positioned differently, produced different patterns of proxy-criterion divergence.

PSF reads Leonardi and Leavell as the cleanest available evidence for Faulkner and Runde's claim that organisational positioning determines which proxies become salient. The provisionality case (Mountain) shows that constraining AI's epistemic authority slows proxy drift. The artificial certainty case shows what happens when no such constraint operates. The positioning can drift through accumulated use without any deliberate organisational decision. This is the most heavily cited new source in the OS Perspectives paper (at least six references).

Bean, A. et al. (2026) 'Human Decision-Making with AI Assistance', Nature.

RCT with 1,298 participants. LLMs alone achieved approximately 95% accuracy. With human users, accuracy dropped to approximately 35%, no better than control. The benchmark proxy (standalone AI accuracy) failed to predict interactive outcome. This directly demonstrates that the proxy metric (AI accuracy on benchmarks) diverges from the criterion (human-AI collaborative performance) in exactly the way PSF predicts. Organizations that evaluate AI tools through benchmark accuracy will systematically overestimate their value in interactive use. The gap between standalone AI performance and human-AI interactive performance is one of the starkest quantifications available.

DORA (2025) 'Accelerate State of DevOps Report'. Google Cloud.

AI code assistance was associated with lower perceived productivity, delivery stability, and job satisfaction among elite performers. The pattern inverted the general trend: average performers reported neutral to positive AI experience; elite performers reported negative experience.

DORA supports PSF's claim that practitioners with the highest judgment stock experience the proxy-criterion gap most acutely. Elite performers have the consequence-built tacit knowledge to detect when AI-generated outputs are proxies for quality rather than quality itself. Their negative experience is detection: the evaluative capacity that proxy seduction most threatens is most active in those who have developed it most fully. The DORA canary-in-the-coal-mine pattern is what PSF predicts when detection capacity is concentrated in senior practitioners: the organization loses its most reliable detection capacity first, through dissatisfaction and disengagement, before the proxy-criterion divergence becomes visible in aggregate metrics.

Vendraminelli, L., Morandi, V. and Gruber, M. (2025) 'The GenAI Wall Effect', Harvard Business School Working Paper No. 26-011.

Documents diminishing returns to AI assistance as task complexity increases. For routine tasks, AI provides substantial gains. For complex tasks, gains diminish and can turn negative. The mechanism involves AI's limitations with tasks requiring deep domain expertise, contextual judgment, or integration of diverse information sources.

Vendraminelli et al. map the structural shape of the proxy-criterion gap. The wall effect describes exactly where proxy metrics track accountable criteria (routine, decomposable tasks: both show gains) versus where they diverge (complex, judgment-intensive tasks: proxy metrics may show gains while criterion-level outcomes deteriorate). PSF uses the wall effect to explain why proxy seduction is self-reinforcing: organizations that evaluate AI through market-logic metrics will observe gains across the portfolio (weighted toward routine tasks where the proxy tracks the criterion) while missing that their most complex, judgment-intensive work is deteriorating.

Fernandes, D., Lynch Jr., J.G., Dalton, A.N. and Netemeyer, R.G. (2026) 'AI makes you smarter but none the wiser', Computers in Human Behavior.

Two large-scale studies (N=246, N=452). Task performance improved compared to norms. Participants believed they improved by a larger margin: an overestimation gap. More striking: participants with greater AI literacy were more confident in their judgments but less accurate. Higher AI literacy correlated with lower metacognitive accuracy.

Fernandes et al. directly support PSF's claim that proxy seduction cannot be corrected through training or AI literacy programs. The paradox (higher AI literacy correlates with lower metacognitive accuracy) is what PSF predicts: training increases confidence in applying proxy metrics (participants know more about how to use AI effectively) while eroding the metacognitive monitoring that would detect proxy-criterion divergence.

Workday (2026) Beyond Productivity: Measuring the Real Value of AI. January.

3,200 employees and leaders, cross-industry, all full-time at organizations with over $100M revenue. 90%+ of daily AI users confident AI will help them succeed. 14% achieve consistently positive net outcomes. 37% of time saved through AI is lost to rework. 89% of organizations report fewer than half of roles updated to reflect AI capabilities.

Workday provides organizational-scale evidence for several PSF mechanisms simultaneously. The 90%-confident versus 14%-positive-outcomes divergence is the Barnesian performativity effect: field-level AI discourse has constituted confidence in AI value as a proxy for AI value itself, producing the belief independently of the outcomes that would justify it. The 37% rework finding is behavioral evidence of the proxy-criterion gap: practitioners perceive time savings (the proxy metric) while downstream quality costs (the accountable criterion) are externalized to rework and not attributed to AI. The 89% unchanged-roles finding is evidence of the form/function gap at organizational scale.

Stack Overflow (2025) Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here.

n=49,000. Adoption rose to 80% while trust in accuracy fell from 40% to 29% year-over-year. Positive favorability dropped from 72% to 60%. The leading frustration (45% of respondents) is AI-generated code that is "almost right, but not quite." Two-thirds report spending more time fixing such code than writing it would have required.

PSF reads the Stack Overflow data as behavior-belief decoupling: practitioners who can articulate the proxy-criterion divergence cannot exit the engagement because institutional infrastructure (organizational mandates, resource allocation decisions, Barnesian performativity of field-level discourse) has already committed to the proxy metrics. The vocabulary gap is the key PSF mechanism here: the "almost right" experience resists institutional legibility because market-logic vocabulary cannot encode it. The frustration is real, the practitioners experience it, but it does not travel up organizational channels as a signal that would interrupt displacement.

Cruces, G., Bergman, D., Morin, L. and Saez, E. (2026) 'AI Tutoring and the Scaffolding Trap: Evidence from Randomized Experiments', NBER Working Paper No. 34851.

Cruces and colleagues ran randomized controlled trials of AI tutoring across multiple educational contexts. Students using AI tutoring showed 75% of the education gap closure achieved by human tutoring on immediate post-tests. However, gains dissolved without AI support: on delayed assessments without AI access, AI-tutored students showed no persistent learning advantage over controls. The learning was scaffolded, not internalized.

Cruces et al. directly support PSF's judgment stock claim with experimental evidence at a different level of analysis (education) that transfers to organizational knowledge work. The 75% gap closure is the proxy metric: students appear to have learned. The dissolution without AI is the criterion-level reality: they have not built durable knowledge structures. For PSF, Cruces et al. provide the sharpest empirical warrant for the claim that proxy seduction does not just displace current criteria but degrades the pipeline through which criterion-level competence develops.

Humlum, A. and Vestergaard, E. (2025) 'The Labor Market Effects of Generative AI', American Economic Review. (In press)

Humlum and Vestergaard analyze a natural experiment in Denmark: the staggered rollout of ChatGPT across firms and industries. Despite high adoption rates across the Danish labor market, they find no detectable impact on wage levels, employment levels, or hours worked. The perception of AI's labor market impact is substantially more positive than the measured impact.

Humlum and Vestergaard provide the field-level evidence for PSF's core claim. The null effect is what PSF predicts when Barnesian performativity has constituted proxy metrics (individual productivity perceptions, immediate output quality, speed of task completion) as the evaluative vocabulary while aggregate criterion-level outcomes (labor market productivity, wage growth) remain unaffected. Proxy seduction at scale: confident practitioners, confident organizations, zero aggregate outcome change.

Gimbel, M., Kinder, M., Kendall, J. and Lee, M. (2025) 'Evaluating the impact of AI on the labor market', Yale Budget Lab.

No significant AI-related labour displacement in occupations with high AI exposure. The gap between the performative frame (AI is displacing workers) and the empirical reality (no detectable displacement) is itself diagnostic.

PSF reads Gimbel et al. as counter-evidence to performative displacement claims. The frame performed headcount reduction into the position of a criterion, and the proxy-criterion divergence arrived on schedule. Paired with IBM's hiring reversal: CEO claimed AI replaced several hundred HR employees, then nine months later IBM tripled entry-level hiring after cutting junior roles collapsed the talent pipeline.

Srinivasan, S., Hoffman, M. and Nandkumar, A. (2026) 'AI Engagement and the Labor Economics of Knowledge Work', Harvard Business School Working Paper.

Srinivasan, Hoffman, and Nandkumar argue the displacement-or-complement question is misspecified because the framing assumes organizations accurately observe which outcome is occurring. Short-run complementarity appearance, medium-run displacement signals.

Srinivasan et al. contribute a specific empirical challenge to PSF's framing against incumbent accounts. Organizations are making workforce decisions using proxy metrics (headcount, output volume, cost per task) that cannot detect the capability requirement shifts their medium-run data documents. The short-run appearance of complementarity maps onto PSF's temporal structure: proxy seduction produces felt complementarity while criterion-level capability requirements shift in ways not captured by the proxy evaluation apparatus.

5.2 Creativity Studies Cluster

Individual output quality (the proxy) improves while collective diversity (an accountable criterion for organizations whose competitive advantage depends on portfolio distinctiveness) degrades. Market-logic metrics capture the individual improvement. Professional-logic criteria register the collective loss, which is invisible to the evaluation instruments organizations typically use.

Doshi, A.R. and Hauser, O.P. (2024) 'Generative AI enhances individual creativity but reduces the collective diversity of novel content', Science Advances, 10(28).

AI assistance increased individual creativity scores: stories produced with AI assistance were judged as more creative, better written, and more engaging by blind raters. But collective diversity collapsed: the AI-assisted stories were significantly more similar to each other than the unassisted stories. The mechanism: AI models draw on the same training distribution, suggesting similar narrative moves. Each user finds these helpful. The result is individual improvement within a converging distribution. This is the signature proxy-criterion divergence at the collective creative level.

Anderson, J. et al. (2024) 'Homogenization Effects of Large Language Models on Human Creative Ideation', Working Paper.

N=1,100 participants brainstorming solutions to social problems. AI assistance increased fluency and confidence but reduced semantic diversity across participants. The effect was dose-responsive: more AI use, more convergence. The convergence effect was strongest for participants with lower baseline creativity. The dose-response relationship maps onto PSF's engagement intensity prediction: heavier AI engagement produces greater convergence, which produces greater divergence between the individual quality proxy and the collective diversity criterion.

Meincke, F., Collins, H. and Evans, R. (2025) 'Idea Overlap Between AI and Humans', Science Advances, 11(14).

Human ideas showed 100% uniqueness: each human generated ideas no other human had generated. AI ideas showed 94% overlap: 94% of AI-generated ideas were duplicates. The mechanism draws directly on Collins's framework: AI systems generate from the same distributional space. Human ideas draw on individual socialization, embodied experience, and community membership. PSF uses this as the clearest available quantification of proxy-criterion divergence: speed and volume (proxy) high, distinctiveness (criterion) at 6%. The 94-percentage-point gap is the sharpest available measurement.

Moon, C., Suh, S. and Lim, J. (2025) 'AI Assistance and Creative Convergence', Management Science. (In press)

Structural mitigations (team diversity, explicit diversity prompts, multiple AI models) reduce but do not eliminate convergence. Even teams explicitly instructed to seek diverse AI perspectives show significantly higher convergence than human-only teams. Supports PSF's claim that proxy seduction is not correctable through interventions organizations would naturally deploy.

De Freitas, J., Henkel, L. and Cikara, M. (2025) 'The Convergence Effect', Psychological Science, 36(4), pp. 489-503.

Observers systematically prefer AI-assisted outputs when evaluating individual pieces but systematically underestimate convergence when evaluating them as a portfolio. The evaluation instruments observers apply to individual pieces are systematically insensitive to portfolio-level convergence. The proxy metric (individual piece quality) is actively misleading about the criterion: observers who apply individual-quality metrics prefer the outputs with the highest portfolio convergence, because the features that make individual outputs attractive are the features convergence optimizes.

5.3 Gray Literature and Field Notes

These function as illustrations of PSF mechanisms operating in real time. They are not evidence for the theory; they are instances of what the theory explains. They are also constituents of the Barnesian performativity process that PSF describes.

Haupt, A. and Brynjolfsson, E. (2025) 'Centaur evaluations', ICML Position Paper.

The dominant evaluation paradigm assesses AI systems as potential replacements for human labor rather than as augmentors. Benchmarks encode automation logic. Organizations using these benchmarks inherit automation logic as their evaluative framework. The instrument constitutes the question. In PSF, Haupt and Brynjolfsson provide evidence of the material agencements (in Callon's sense) through which Barnesian performativity operates on AI evaluation.

Koren, E., Hazan, E. and Bar-Yossef, Z. (2026) 'Vibe Coding Kills Open Source', arXiv:2601.15494.

Downloads (proxy for adoption) rising while documentation traffic, issue engagement, and revenue (criteria for ecosystem generativity) falling. Tailwind CSS: downloads up, revenue down 80%. The metric measures something constitutively different from what it measured before AI mediation. Classic PSF: the evaluative vocabulary has not updated, so the divergence is invisible to the instruments maintainers use.

Yegge, S. (2026) 'The Eight Levels of Programmers', Blog post. January.

The "evolution" framing naturalizes evaluative discontinuity. The Level 8 developer is not a better programmer. They are a different kind of worker: a factory manager rather than a craftsperson. Practitioners reading Yegge may aspire to Level 8 without recognizing that achieving it involves abandoning the capabilities they currently value. PSF reads Yegge as the vampire problem in practitioner discourse: the one-way door dressed as a staircase. The framework constitutes "managing AI systems" as the criterion for programmer excellence while rendering professional-logic criteria as anachronistic.

Executive Claims Repository: Twilio CEO, Anthropic Self-Report, IBM, McKinnon/Okta, Katz/NYC H+H, and Others.

A growing collection sharing structural features: optimistic framing, no specification of what was measured or how, unchanged evaluative criteria, and absence of mechanism. When a CEO says "AI makes our developers 30% more productive" without specifying what 30% measures, the statement constitutes "developer productivity" as a category appropriately evaluated through AI-compatible proxy metrics and renders professional-logic criteria invisible. Each claim constitutes proxy metrics as legitimate evaluative categories. The aggregate is a real-time archive of Barnesian performativity in action. PSF reads these as illustrations of performative effects operating on the evaluation apparatus itself: conventional performativity routinises the metric vocabulary, generic performativity produces the conditions the claims describe, and framing performativity establishes what counts as evidence of AI value.

Raad, D. (2026): Five Mechanisms from Practitioner Experience. CEO of anoma.ly, February 2026. Documents five PSF mechanisms: (1) implementation cost as quality filter, (2) effort substitution from creation to prompting, (3) craftsperson adverse selection, (4) bottleneck displacement from building to evaluating, (5) hidden LLM integration costs. Strong face validity but anecdotal.

Henderson: AI in the Security Domain. Security professional perspective on AI integration risks. The security context amplifies consequences of proxy seduction because the gap between proxy and criterion can produce exploitable vulnerabilities.

5.4 Counterpositions

Positions PSF engages critically. These are genuine interlocutors whose work PSF takes seriously, not strawmen.

Mollick, E.: AI Augmentation Optimism. Primary counterposition. Mollick emphasizes AI as augmenting human capability, treats engagement as straightforwardly beneficial, and interprets productivity data through an optimistic lens. PSF argues this framing misses the transformative-experience dimension: AI engagement changes the evaluator, not just the output. Mollick is a thoughtful interlocutor who updates positions.

Brynjolfsson, E.: Productivity and Complementarity. J-curve, GDP-B, Turing Trap, centaur benchmarks, "canaries" (young worker displacement). Treats the productivity gap as a measurement plus timing problem. PSF divergence: Brynjolfsson assumes organizations will recognize the J-curve dip and invest out. PSF asks what happens when engagement erodes the evaluative capacity to recognize the dip. Strategy-level analysis, not evaluative-capacity-level.

Bailey, D. and Brynjolfsson, E.: AI Productivity Studies. Specific empirical work. PSF accepts the measured effects but questions whether the metrics capture the right things. The disagreement is about scope and interpretation.

Eismann, E.: UX Research and AI Integration. Represents the view that well-managed AI integration is straightforwardly beneficial. PSF argues that even well-managed integration may erode evaluative capacity.

Choudary, S.P.: Platform Dynamics and AI. Platform economics framing may miss the evaluative capacity dimension.

Sziebert (Google Cloud AI). Proxy substitution framed as empowerment. Role titles frame displacement as promotion. UI assumes evaluative capacity it needs. "18-Month Wall."

Hallowell (LinkedIn, March 2026). Multi-agent personas mask stochastic homogeneity. Emergent sycophancy at system level.

5.5 Corroborating Evidence

OpenAI Usage and Productivity Data. Provider-generated data with clear commercial interests, useful as evidence of proxy metric generation rather than as independent measurement.

Uplevel Developer Productivity Data. Engineering analytics data on developer productivity. Platform-specific measurement: the measurement methodology shapes what counts as productivity.

Massenkoff, M. and McCrory, E.: Labor Market Analysis. Used in "The AI Alibi" to ground workforce displacement claims in labor market data. Captures aggregate effects, not micro-level evaluative capacity erosion.

Stanford Digital Economy Lab: Canaries in the Coal Mine? Early warning indicators in AI-affected labor markets. Integrated into AI Alibi v5. Macro-level economic analysis.

Barcaui (2025). Referenced across three locations in the PSF paper per March 2026 revision notes.

Rabanser et al. (2026): Princeton HAL System Properties Assessment. arXiv 2602.16666. Technical system properties do not equal organizational evaluative capacity. Requires explicit level-mismatch acknowledgment when cited.

Eloundou, T., Manning, S., Mishkin, P. and Rock, D. (2024) 'GPTs are GPTs: Labor market impact potentials of large language models.'

80% of the US workforce is exposed to LLMs across at least 10% of their tasks. Grounds the substrate breadth argument: AI operates on language, the medium through which most knowledge work is conducted. This breadth distinguishes generative AI from prior automation waves, which operated on narrower task substrates. The breadth is what makes fertile form consequential at scale and what makes proxy seduction a field-level rather than niche phenomenon.

6. Methodological Warrant

Fisher, G., Mayer, K.J. and Morris, S. (2021) 'From the Editors: Making Theory-Empirics Dialogue Work', Academy of Management Review, 46(4), pp. 695-706.

Fisher, Mayer, and Morris introduce phenomenon-based theorizing as a distinct and legitimate approach to theory development. The phenomenon is the starting point: the researcher begins with an empirical puzzle that existing theory cannot adequately explain. Theoretical resources are borrowed from multiple literatures as needed. The contribution is measured by how well the theoretical architecture explains the phenomenon, not by how deeply it extends a single theoretical tradition. The four-resource PSF architecture (Paul, Faulkner and Runde, Thornton et al., MacKenzie/Callon) is justified by this logic: remove any one resource and the explanatory architecture has a gap. Paul explains why pre-engagement evaluation is epistemologically unavailable; Faulkner and Runde explain why the same technology produces different proxy-criterion gaps; institutional logics explains the displacement mechanism; MacKenzie explains how proxy metrics become self-reinforcing. Fisher et al. belong in the cover letter and in the theoretical development section's integration paragraph.

Alvesson, M. and Sandberg, J. (2011) 'Generating Research Questions Through Problematization', Academy of Management Review, 36(2), pp. 247-271. Distinguishes problematization from gap-spotting. PSF's contribution is identifying a shared assumption (evaluative continuity) that existing theories take for granted. Governs the critique sections of the paper.

Elsbach, K.D. and Van Knippenberg, D. (2020). Justifies combining literatures that do not normally speak to each other.

Mayer, K.J. and Sparrowe, R.T. (2013). Approach 4: shared explanatory mechanism across literatures. Specifies how the combination is structured.

Lakatos, I. (1970). Resources must be jointly necessary, individually insufficient, and the combination must generate predictions none could make alone. The falsification conditions in the PSF paper are structured against these criteria.

Whetten, D.A. (1989) 'What Constitutes a Theoretical Contribution?', Academy of Management Review, 14(4). AMR touchstone. PSF maps: what (proxy metrics, evaluative capacity, judgment stock), how (displacement through logic asymmetry), why (sincere belief through constitutive transformation), boundaries (transformative technology, not all technology).

Corley, K.G. and Gioia, D.A. (2011). Both scientific utility and practical utility. PSF's evaluative capacity dimensions generate organizational diagnostics.

Davis, M.S. (1971) 'That's Interesting!' Theories that succeed challenge an assumption. PSF's core move: what seems to be productivity improvement is actually proxy substitution.

Cornelissen, J. (2017). Creative combination of resources that do not normally sit together. The creative combination is what makes PSF harder to position but also what makes it worth reading.

Lomellini: Four Ingredients of Theory Building. Core concepts, linkages, mechanisms, boundary conditions. PSF was stress-tested against this framework. Boundary conditions could use the most explicit treatment in the paper.

7. Cross-Cutting Synthesis

The full causal chain for PSF runs as follows. Field-level AI discourse, operating through material agencements (MacKenzie, Callon) and the three performativity channels Cabantous and Gond identify (conventional, generic, framing), constitutes market-logic proxy metrics as the legitimate vocabulary for evaluating AI engagement before any organisation tests their validity. Organizations inheriting this vocabulary enter engagement with criteria already shaped by the performative process (Thornton et al., Ocasio et al.). Within organizations, practitioners experience AI engagement as transformative in Paul's sense: their capacity to evaluate outcomes is restructured by the engagement itself. The technology's fertile form (Faulkner and Runde) enables positioning drift through practice: what began as a productivity tool acquires different functions as practitioners tune to its outputs (Pickering) and as boundary conditions reconfigure (Barrett et al.). The proxy metrics colonize practical reasoning through their competitive advantage in legibility (Nguyen), and the hard choices through which judgment would have been built disappear from experience (Chang). Market logic primes practitioners to attend to proxy metrics (Weber and Glynn, Weick) while professional logic criteria undergo salience decay (Schatzki). Collective tacit knowledge (Collins, Polanyi) that would enable proxy-criterion discrimination degrades as the developmental conditions that sustain it (Beane, Dreyfus, Endsley) are removed by AI engagement. Organizational defensive routines (Argyris) protect the proxy narrative from disconfirming evidence. March's exploitation bias ensures that visible exploitation gains crowd out invisible exploration losses. The result is asymmetric ambidexterity: organizations succeed visibly at market-logic evaluation while professional-logic evaluative capacity erodes unmeasured.

The boundary condition for PSF (what distinguishes proxy seduction from Goodhart's Law) is sincere belief. The organizational actors in PSF are not gaming the metric. They are using the metric as the criterion because their evaluative framework has been constituted by the engagement to make the proxy metric feel like the criterion. This is what Gioia et al.'s adaptive instability explains at the identity level, what Schatzki's teleoaffective structures explain at the practice level, what Nguyen's value capture explains at the philosophical level, what Chang's parity elimination explains at the agency level, and what Shaw and Nave's cognitive surrender explains at the cognitive level. The sincerity is what the incumbent literature lacks vocabulary for and what PSF supplies.

The empirical prediction follows from the full chain: wherever AI engagement is intensive and prolonged, the proxy-criterion gap should widen over time, the detection capacity of practitioners should erode across cohorts, and the organizational braking mechanisms should fail to activate even when individual practitioners with high judgment stock signal concern. METR, Daniotti et al., Bean et al., Brynjolfsson et al., Leonardi and Leavell, DORA, Stack Overflow, Workday, Humlum and Vestergaard, Gimbel et al., and Cruces et al. all provide evidence consistent with these predictions, from individual to organisational to field level.