This synthesis documents an evolving theoretical inquiry rather than a settled framework. The central empirical puzzle: organizations systematically misjudge AI engagement outcomes in ways that resist correction. The gap between expected and actual outcomes is not random noise. It is patterned, persistent, and present even among experienced practitioners with every incentive for accurate self-assessment.
The initial theoretical move, captured in Evaluative Discontinuity Theory (EDT), described this as criteria reconstitution: AI engagement transforms the evaluating organization such that the criteria by which it judges outcomes shift in ways that produce systematic gaps. EDT explained why prospective evaluation is structurally unreliable (Paul's transformative experience) and why reconstituted criteria become institutionally stabilized (Thornton et al., Weick, Argyris).
As evidence accumulated, it pointed toward a more specific account. The criteria, strictly speaking, do not reconstitute. An organization accountable for code robustness remains accountable for code robustness. What shifts is what practitioners at each level attend to when assessing whether that criterion is being met. The attentional account operates across three levels simultaneously. At the individual and cognitive level, Shaw and Nave's cognitive surrender documents how practitioners' evaluation becomes coupled to AI output quality without reliable self-awareness. At the organizational level, market-logic proxy metrics (speed, volume, throughput) become what gets measured, discussed, and rewarded, while professional-logic criteria remain nominally present but operationally recessive. At the field and institutional level, Barnesian performativity (MacKenzie, Callon) constitutes proxy metrics as the legitimate evaluative vocabulary before any organization tests their validity.
The current working account, which this synthesis calls the Proxy Seduction Framework (PSF), describes this as proxy seduction: the process by which proxy metrics become the operative evaluative vocabulary through sincere organizational belief rather than strategic gaming. That sincerity is the mechanism distinction from Goodhart's Law and what the prior literature lacks vocabulary for.
PSF is the present state of theoretical development, not a finalized theory. It should be read as a working account, responsive to evidence already accumulated and open to revision as the empirical phase proceeds. The theoretical architecture (multiple theoretical resources, three evaluative capacity dimensions, three levels) represents the current best understanding of what the data requires. Evidence from the planned empirical work, including approximately 50 interviews with frontline practitioners and boundary activity performers, may sharpen, qualify, or reframe elements of that architecture.
The theoretical resources and the work each currently does:
| Resource | Source | Current Function |
|---|---|---|
| Transformative experience | Paul (2014) | Explains why practitioners cannot anticipate the attentional substitution before engagement; the criteria that would detect it only exist after the transformation |
| Form/function ontology | Faulkner and Runde (2019) | Explains why the same AI tool produces different proxy-criterion gaps in different contexts; fertile form enables positioning drift through practice |
| Institutional logics | Thornton, Ocasio, and Lounsbury (2012) | Explains the displacement mechanism; market logic's metrics are legible, professional logic's criteria are not, and the asymmetry drives proxy substitution |
| Performativity (Barnesian economic) | MacKenzie (2006); Callon (2007) | Explains how field-level discourse (MacKenzie) and its material infrastructure of benchmarks, frameworks, and vendor demonstrations (Callon) constitute proxy metrics as legitimate evaluative categories before any organization tests their validity, with performative effects operating on the evaluation apparatus itself rather than only on organizational practices |
| Performativity (organizational-translation) | Cabantous and Gond (2011); Gond, Cabantous, Harding and Learmonth (2016); Cabantous, Gond, Harding and Learmonth (2016) | Supplies the three-mechanism taxonomy (conventional, generic, framing) specifying the channels through which Barnesian performativity propagates at the organizational level; anchors PSF's response to Gond et al.'s call for organizational conceptualizations of how performativity is organized; marks PSF's position against critical performativity |
| Performativity (sociomaterial AI, adjacent register) | Orlikowski and Scott (2014); Scott and Orlikowski (2022, 2023); Scott and Orlikowski (2025) | Adjacent work in a different ontological register (agential realism), sharing performative and constitutive commitments while resisting the form/function analytical separation PSF requires; provides empirical precedent (TripAdvisor valuation apparatus) and framework vocabulary (digital undertow, AI-in-the-making) that PSF draws on across the ontological difference |
Three evaluative capacity dimensions under investigation: detection (whether practitioners can sense divergence between proxy and criterion), judgment stock (whether practitioners have consequence-built tacit knowledge sufficient to discriminate proxy from criterion), and braking (whether evaluative institutional infrastructure can interrupt displacement once underway).
Three philosophers, each operating at the level of the individual agent making evaluative choices, specify what happens before, during, and after proxy seduction takes hold. The logic is sequential: each moment depends on the one before it. Paul explains why the transformation cannot be anticipated. Nguyen explains how simplified metrics colonize reasoning once the practitioner is inside the transformation. Chang explains what is lost when the colonization succeeds. Together they provide the individual-level phenomenology that PSF's organizational-level claims rest on.
PSF's relationship to these three philosophers requires honest accounting. The proxy elevation mechanism is an application of Nguyen's value capture to AI-mediated knowledge work. The anticipation failure is Paul's transformative experience applied to organizational technology engagement. The hard-choice elimination is Chang's framework applied to practitioner decision landscapes. PSF's genuine contributions beyond these philosophers are the evaluative capacity erosion architecture (detection, judgment stock, braking) and the multi-level propagation mechanism (practitioner to organization to field). Neither Nguyen, Paul, nor Chang theorizes these dynamics.
The sequence also has a feedback structure. The loss Chang describes feeds forward into the next engagement cycle: a practitioner whose capacity to recognize hard choices has eroded is even less equipped to anticipate how the next round of AI engagement will transform their evaluative apparatus. The sequence deepens with each cycle.
Paul challenges the presupposition underlying standard expected utility theory: that rational agents can assess outcomes by projecting utilities and choosing accordingly. The challenge applies specifically to transformative experiences, those that are both epistemically and personally transformative.
An experience is epistemically transformative when you can only know what it is like by having it. Testimony from those who have undergone the experience cannot substitute for first-person acquaintance. An experience is personally transformative when it reshapes preferences, values, and identity in ways the pre-transformation self could not predict or evaluate from inside its current value structure. Parenthood is Paul's central case: epistemically transformative (you cannot know what parenting is like until you are a parent) and personally transformative (it changes what you care about in ways you could not have anticipated).
Paul's core argument: some experiences are doubly transformative. The you who would evaluate the outcome of the decision does not exist until after you have made it and undergone the transformation. The preferences you would need to evaluate the decision correctly are not available pre-transformation. More testimony, more careful analysis, more pilot programs, and better planning cannot solve this because the problem is not informational. It is structural.
The vampire case makes the structure vivid: you are offered the chance to become a vampire. You cannot know what it is to want blood, prefer darkness, and lose human attachments without becoming a vampire. The criteria by which you would judge the decision are the very criteria the transformation would replace. Choosing rationally in the standard sense (projecting expected outcomes and comparing utilities) is unavailable.
PSF applies Paul's structure to organizational AI engagement. AI engagement may be transformative in Paul's exact sense: the practitioners and organizational processes that will evaluate outcomes are not the same practitioners and processes that made the engagement decision. The engagement itself reconstitutes what practitioners attend to, what they find salient, and what counts as "good work."
PSF does not use Paul's resolution (treating transformative decisions as choices to value discovery itself). PSF uses Paul's diagnosis: the evaluative criteria needed to detect proxy substitution only emerge, or fail to emerge, through the engagement itself. A software developer who has worked with AI code generation for six months is not the same developer who would have caught the proxy-criterion divergence before engagement began. Anthropic's own engineers self-reported shifting to "70%+ code reviewer/reviser" roles, a transformation they recognized but did not fully anticipate before the engagement began. The detection capacity is retrospectively unavailable.
Critical PSF specification: Paul's framework is individual. PSF extends it to organizational evaluation processes. The practitioner's transformative experience is the micro-level event. What PSF theorizes is the organizational accumulation of those events and the institutional infrastructure (or lack thereof) through which individual transformed judgment aggregates into organizational evaluative capacity or its erosion.
How robust is the individual-to-organizational extension? Paul theorizes individual transformative experience. PSF applies it to organizations. The extension is defensible (organizational theory routinely aggregates individual-level phenomena) but needs explicit justification. What is lost in translation from person to organization?
Is AI engagement genuinely doubly transformative, or only epistemically transformative? If AI engagement changes what practitioners know but not what they value, the Paul warrant is weaker. The empirical phase should probe for evidence of personal (not just epistemic) transformation: has what you care about in your work changed, or just what you know about your work?
Does the strength of transformative experience vary by domain? Paul's framework does not predict variation. PSF's empirical phase should test whether some domains (software development, customer support, creative writing) produce stronger transformative effects than others, and whether that variation tracks with AI's "fertile form" properties.
Literature move: Borrowed. Paul's epistemological structure is PSF's foundation for why pre-engagement evaluation is structurally unavailable. PSF does not claim to extend Paul theoretically. It borrows her diagnosis and applies it to an organizational phenomenon.
Role in the sequence: Paul occupies the "before" position. She explains why the practitioner enters engagement unable to anticipate how it will reshape what counts as good work. Pre-engagement criteria feel stable. They are not. This is the precondition for everything that follows.
Nguyen, C.T. (2020). Games: Agency As Art. New York: Oxford University Press. Ch. 9: "Gamification and Value Capture," pp. 189-215. DOI: 10.1093/oso/9780190052089.003.0009
Nguyen, C.T. (2024). "Value Capture." Journal of Ethics and Social Philosophy, 27(3). DOI: 10.26556/jesp.v27i3.3048
Nguyen, C.T. (2021). "How Twitter Gamifies Communication." In J. Lackey (ed.), Applied Epistemology. New York: Oxford University Press, pp. 410-436.
Nguyen, C.T. (2026). The Score: How to Stop Playing Somebody Else's Game. New York: Penguin Press.
Also: Artificiality Institute podcast, "Metrification" (March 2024). Sean Carroll, Mindscape podcast, Episode 169 (October 2021).
Value capture occurs when an agent's values are rich and subtle (or developing in that direction), the agent enters a social environment that presents simplified (typically quantified) versions of those values, and those simplified articulations come to dominate the agent's practical reasoning. Examples: FitBit step counts, Twitter Likes and Retweets, citation rates, GPA, ranked lists of "best schools."
Why it happens: Simplified value articulations have a "competitive advantage" in practical reasoning. They are clear, portable, and comparable. They function well in both private deliberation ("Am I doing well?") and public justification ("Here is evidence I am doing well"). Richer values are harder to articulate, harder to compare across contexts, and harder to defend in institutional settings that demand legibility.
The autonomy claim: In value capture, the agent outsources a central component of autonomy: the ongoing deliberation over the exact articulation of values. The agent stops adjusting values in light of rich, particular, context-sensitive experience and instead "buys values off the rack." The metrics to which deliberation is outsourced are typically engineered for the interests of some external force (an institution's interest in cross-contextual comprehensibility, quick aggregability, or scale).
The seduction mechanism (from Games, Ch. 9): Games offer "seductive experiences of value clarity." In games, values are clear, achievements are quantifiable and rankable, and the agent knows exactly what they are doing and why. This is unproblematic in a gaming context because it is temporary. Gamification exports this clarity into real-world activities, forcing a singular clarified value system onto domains where values are inherently plural and contested.
The reward: Value capture offers a "delightful reward." Once an agent permits value capture, values become clear, coherent, and actionable. The agent experiences motivational focus and a sense of progress. The capture feels like improvement, not loss.
The group-level extension: Value capture afflicts groups as well as individuals. A philosophy department can be captured by its university's focus on student evaluation scores. Even when a group agrees that they care more about some inchoate value (like fostering curiosity), day-to-day decisions end up driven by whatever clear metrics happen to be on hand.
The honest relationship: PSF's proxy elevation mechanism is an application of Nguyen's value capture to AI-mediated knowledge work. The parallels are structural, not approximate:
Nguyen's "seductive experiences of value clarity" = PSF's "irresistible legibility"
Nguyen's "competitive advantage of crisp articulations" = PSF's account of why proxy criteria dominate
Nguyen's "the simplified versions take over" = PSF's proxy elevation
Nguyen's "delightful reward" of post-capture clarity = PSF's observation that proxy seduction feels like progress
Nguyen's "buying values off the rack" = PSF's account of practitioners adopting standardized metrics (speed, volume, certainty) because they are available and defensible, not because they capture what matters
The substitution mechanism is the same phenomenon described in different vocabulary.
First, evaluative capacity erosion. Nguyen theorizes value distortion (the wrong values dominate) and notes that the distortion feels good. He stops there. PSF specifies that engagement simultaneously degrades the capacity to detect the distortion, and breaks that degradation into three assessable dimensions (detection, judgment stock, braking). This is not a relabeling of Nguyen. Nguyen has no equivalent claim. For Nguyen, the agent's values are captured. For PSF, the agent's capacity to notice the capture is simultaneously degraded, and that degradation follows a specific architecture.
Second, multi-level propagation mechanism. Nguyen has a brief group-level extension. PSF integrates institutional logics (Thornton, Ocasio, and Lounsbury) and Barnesian performativity (MacKenzie) to trace how proxy elevation propagates from practitioner to organizational to field level, with each level reinforcing the others through feedback loops. That is structural machinery Nguyen does not build.
A third possible contribution, less certain: PSF claims that AI's "fertile form" (Faulkner and Runde) constitutes the proxy metrics through the engagement process itself, rather than the institution pre-specifying them. If this constitutive claim holds, it distinguishes AI-mediated value capture from Nguyen's general case (where the metrics pre-exist the engagement). Whether this distinction is deep or merely contextual is an open question.
Philosophical grounding for the "competitive advantage" of legible metrics. Nguyen provides the most developed philosophical account of why simplified metrics win in practical reasoning. Having Nguyen's independent philosophical warrant means the claim that proxy elevation is structurally predictable rests on established philosophical argument, not just PSF's assertion.
The autonomy dimension. Nguyen's argument that value capture outsources deliberation connects to PSF's judgment stock erosion. When practitioners stop deliberating over the exact articulation of their evaluative criteria (because the proxy metrics are so clear), the deliberative capacity itself atrophies. Nguyen provides the philosophical warrant for why this atrophy is structurally expected.
The Goodhart's Law differentiation. Nguyen's framework is deeper than Goodhart's Law. Goodhart describes gaming (agents strategically optimizing against a measure). Nguyen describes sincere capture (agents genuinely coming to value the proxy). PSF inherits this distinction. Having Nguyen's independent philosophical account strengthens PSF's insistence that proxy seduction operates through sincere belief, not strategic gaming.
How deep is the constitutive distinction? PSF claims AI engagement constitutes proxy metrics through the engagement itself, whereas Nguyen's value capture operates on pre-existing institutional metrics. This is potentially PSF's strongest claim to distinctiveness from Nguyen. But it needs honest scrutiny: is the distinction genuinely structural (a different causal pathway), or is it merely contextual (the same mechanism in a different setting)?
Reversibility. Nguyen describes value capture as potentially reversible (game values can be put away, an agent can recognize capture and resist). PSF describes a progressive, self-reinforcing process where the capacity to detect the capture erodes over time. Is AI-mediated value capture structurally less reversible than other forms, and if so, why?
How does "value collapse" map onto PSF's detection dimension? Nguyen's value collapse concept (overly explicit articulations narrow what the agent even considers) may be doing similar work to PSF's claim that evaluative capacity erosion narrows the practitioner's perceptual field. Worth investigating whether Nguyen provides additional precision that PSF should absorb.
The "engineered for external interests" claim. Nguyen argues that captured metrics typically serve some institution's interest in aggregability and control. PSF argues that proxy seduction operates without strategic design. These are not necessarily incompatible (the institution can benefit from the substitution without having engineered it), but the relationship needs clarifying, particularly for the empirical phase where interview data may reveal institutional actors who do deliberately promote proxy metrics.
Literature move: Applied. PSF's proxy elevation mechanism is Nguyen's value capture applied to AI engagement contexts. PSF's genuine contributions beyond Nguyen are the evaluative capacity erosion architecture and the multi-level propagation mechanism. Nguyen functions as the strongest independent philosophical corroboration of the substitution mechanism and as the clearest differentiation from Goodhart's Law (sincere capture, not strategic gaming).
Role in the sequence: Nguyen occupies the "during" position. Once the practitioner is inside the transformation Paul describes, Nguyen's mechanism takes hold. A customer support team using AI sees resolution time drop and ticket volume rise. Both metrics are immediately legible. Whether the resolutions actually address the customer's underlying problem, whether agents are developing the diagnostic judgment to handle cases AI cannot resolve: none of these are as easy to track, so they quietly recede from attention. The capture is sincere. It feels like progress.
Chang, R. (2025). "Two Mistakes in AI Design." Oxford Colloquium, February 2025.
Chang, R. (2017). "Hard Choices." Journal of the American Philosophical Association.
Chang, R. (2002). "The Possibility of Parity." Ethics, 112(4), pp. 659-688.
Chang identifies a fourth evaluative relation. Standard decision theory recognizes three relations between options: better than, worse than, and equally good. Chang argues for a fourth: "on a par." Two options are on a par when neither is better than the other, they are not equally good, but the comparison is not indeterminate either. The options are qualitatively different in ways that resist ranking on a single scale but remain genuinely comparable.
The commitment claim: When alternatives are on a par, external reasons run out. No amount of additional information, analysis, or measurement can resolve the comparison. Resolution requires commitment: an act of will in which the agent stands behind one option and thereby constitutes reasons for choosing it. This act of commitment is not arbitrary (it is responsive to the values at stake) but it is not determined by those values either. The agent must put their will behind the choice.
Why this matters for agency: Chang argues that hard choices are not obstacles to rational decision-making. They are the occasions through which agents forge their evaluative identity. By committing to one option over another when the options are on a par, the agent creates something: a reason that flows from their own agency, not from the external features of the options. Over time, repeated commitment under conditions of parity is how practitioners develop professional judgment, how they become the kind of practitioner they are.
"Two Mistakes in AI Design" (2025): In this Oxford colloquium, Chang argues that AI systems embed two mistakes. First, the values-proxy assumption: AI systems represent human values through non-evaluative proxies (preferences, choices, ratings) and treat those proxies as if they were the values themselves. This assumption is axiologically guaranteed to produce long-term value misalignment, because values and their proxies come apart over time and across contexts. Second, AI systems cannot recognize parity. Because parity requires commitment (an act of will by an agent with evaluative standing), and AI systems lack that standing, AI cannot resolve the comparisons that matter most for evaluative judgment. The more decisions AI resolves through proxy metrics, the fewer occasions remain for the human commitment that builds evaluative capacity.
The "impressive short-term results" argument: Chang argues that AI's short-term proxy results will be impressive precisely because proxies are designed to be measurable and optimizable. The impressiveness is the problem. It guarantees long-term value misalignment because it eliminates the pressure to check whether the proxies track the values they are supposed to represent.
PSF uses Chang to specify what evaluative capacity erosion actually destroys at the level of practice. PSF's three dimensions (detection, judgment stock, braking) are abstract. Chang makes them concrete.
Detection fails because proxy metrics have already resolved the comparison. When a developer's merge speed and PR throughput are immediately legible, the question of whether shipping quickly or refactoring for clarity better serves the project's long-term health never surfaces as a question. The metric has already answered it. Detection requires recognizing that a genuine comparison exists. When the metric pre-resolves the comparison, there is nothing to detect.
Judgment stock depletes because the occasions that would build it no longer arise. Judgment develops through repeated confrontation with hard choices under conditions of parity. When proxy metrics eliminate those confrontations (by making one option obviously "better"), the practitioner never exercises the commitment that builds evaluative identity. A writer choosing between AI-generated fluency and the slower, rougher process through which a distinctive voice develops faces a genuine hard choice, but only if they still encounter it as a choice rather than an obvious efficiency gain.
Braking fails because there is nothing registering as a problem to brake against. The practitioner's experience is that decisions are easier, metrics are clearer, and output is higher. All signals are positive. Braking requires a signal that something is wrong. When the wrong thing (proxy substitution) looks like the right thing (productivity improvement), braking has no input.
The pincer with Paul: Chang and Paul create a theoretical pincer. Paul explains why practitioners cannot anticipate the transformation before engagement. Chang explains what the transformation eliminates after engagement. The practitioner enters unable to foresee (Paul) and exits unable to recognize what was lost (Chang). The two frameworks address opposite ends of the same temporal arc.
The design-level complement to Faulkner and Runde: Chang's values-proxy assumption provides philosophical grounding for why AI's "fertile form" systematically produces proxy-criterion divergence. Faulkner and Runde explain that form underdetermines function. Chang explains why the functions AI constitutes will systematically embed proxy values rather than the values those proxies are supposed to represent. The problem is not bad design. It is structural: values and their proxies are different kinds of things, and representing one through the other produces drift that accumulates over time.
Is Chang's parity concept empirically detectable in interviews? Practitioners may not use the language of parity or describe their experience in Chang's terms. The interview probes need to surface parity indirectly: "Can you describe a recent situation where you had to make a judgment call that no metric could resolve for you?" A practitioner who cannot recall such situations may be reporting their absence, which is the PSF prediction.
Does the disappearance of hard choices vary by seniority? Senior practitioners have more accumulated judgment stock (built through years of pre-AI hard choices). Junior practitioners may never encounter the hard choices that would have built their judgment. PSF predicts different signatures: seniors report that decisions feel easier (Chang territory), juniors report that they were never hard in the first place (a different, possibly more concerning, finding).
How does Chang's values-proxy assumption relate to PSF's "fertile form" claim? Both argue that AI systematically produces proxy-criterion divergence. Chang's argument is philosophical (the assumption is axiologically guaranteed to fail). PSF's argument is organizational (AI's fertile form constitutes proxies through engagement). These may be the same argument at different levels of analysis, or they may be genuinely distinct claims. Worth clarifying.
Chang's "impressive short-term results" and PSF's "self-concealing degradation." These map directly onto each other. Chang provides the philosophical warrant for why short-term proxy success guarantees long-term value misalignment. PSF provides the organizational mechanism through which that guarantee operates. The empirical phase should look for evidence of both: impressive metrics (Chang) combined with invisible erosion (PSF).
Literature move: Borrowed and applied. Chang's parity framework and values-proxy critique are applied to specify what PSF's evaluative capacity erosion actually destroys. PSF does not extend Chang theoretically. It uses her framework to make the abstract architecture of detection, judgment stock, and braking concrete at the level of practitioner experience.
Role in the sequence: Chang occupies the "what is lost" position. Once Nguyen's value capture has colonized reasoning with clear metrics, the conditions under which parity arises disappear. Hard choices vanish not because they are answered but because they are no longer encountered. With them goes the occasion for developing the judgment that would have been built by confronting them. The loss feeds forward into the next cycle (back to Paul): a practitioner whose hard-choice capacity has eroded is even less equipped to anticipate the next round of transformation.
The Paul-Nguyen-Chang sequence is not linear. It cycles. Chang's output (eroded capacity for hard choices) feeds back into Paul's input (reduced capacity to anticipate the next transformation). Each cycle through engagement deepens the proxy seduction:
Cycle 1: The practitioner enters unable to anticipate (Paul). Proxy metrics colonize reasoning (Nguyen). Hard choices disappear from experience (Chang).
Cycle 2: The practitioner whose hard-choice capacity has eroded is even less equipped to anticipate the next round. The proxy metrics are now the baseline, not a substitution. The absence of hard choices is now normal, not a loss.
Cycle n: The criteria that would have detected the original substitution are no longer in anyone's active repertoire. The proxy has become the criterion, not through strategic choice but through the progressive erosion of the evaluative capacity that would have distinguished them.
This feedback structure is what makes proxy seduction progressive rather than static. It is also what makes it structurally different from Goodhart's Law, where the gaming agent retains the capacity to distinguish the measure from the target and simply chooses not to. In proxy seduction, the capacity to make that distinction erodes through the mechanism itself.
Section 1 provides the individual-level phenomenology. Section 2 (Faulkner and Runde, Thornton et al., MacKenzie) provides the organizational and field-level machinery through which individual-level value capture aggregates into organizational proxy seduction. The division of labor:
| Level | Section | What it explains |
|---|---|---|
| Individual practitioner: before engagement | 1 (Paul) | Why the transformation cannot be anticipated |
| Individual practitioner: during engagement | 1 (Nguyen) | How simplified metrics colonize reasoning |
| Individual practitioner: what is lost | 1 (Chang) | Why hard choices disappear and judgment atrophies |
| Technology-organizational interface | 2 (Faulkner and Runde) | How form underdetermines function and positioning drifts |
| Organizational attention and logic | 2 (Thornton et al.) | How logics channel attention and drive asymmetric evaluation |
| Field-level discourse | 2 (MacKenzie) | How discourse constitutes evaluative criteria before organizations engage |
PSF's causal chain runs through both sections: Paul's anticipation failure (1) meets Faulkner and Runde's fertile form (2), which produces Nguyen's value capture (1) channeled by Thornton et al.'s logic asymmetry (2), deepened by Chang's parity elimination (1), and amplified by MacKenzie's performative discourse (2). The individual and organizational levels interleave. Neither section is sufficient on its own.
PSF borrows from and extends theoretical resources from multiple literatures, each doing bounded work that no other resource does. Remove any one and the explanatory architecture has a gap. This multi-resource design is justified by the phenomenon-based theorizing methodology (Fisher, Mayer, and Morris, 2021): the phenomenon does not sit within any single literature, so the theoretical architecture must draw from multiple literatures as needed. Performativity, the fourth resource, receives extended treatment across four subsections (2.3.1 through 2.3.4) given its centrality to PSF's most distinctive contributions and its role in connecting field, organizational, and individual levels of the mechanism.
Paul also appears in Section 1 as the first element of the mechanism sequence. The treatment here focuses on what bounded work each resource does in the PSF architecture. Paul's entry is not repeated here (see Section 1.1).
Faulkner and Runde identify a gap in how Information Systems research conceptualizes digital technology. Most IS work jumps from artifacts to human and organizational implications without sufficiently theorizing what digital objects are. Their theory begins from a rigorous ontology of objects and works up from there.
The key distinctions: material objects (physical things with intrinsic properties) and nonmaterial objects (syntactic objects and bitstrings). Digital objects combine material bearers (hardware, servers, physical infrastructure) with nonmaterial content (bitstrings and the syntactic objects they encode). The identity of a digital object, what it is in a social sense, flows not from its intrinsic physical properties but from its social positioning within communities of users and practices.
The form/function distinction: Form is what the technology is: its structure, architecture, capabilities, and properties considered independently of use. Function is the role the technology plays in human activity, what it does in practice. Form underdetermines function. Knowing what GPT-4 can do on a benchmark tells you something about its form. It tells you nothing reliable about its function in any particular organizational context, because function emerges through the constitution acts by which humans incorporate the technology into their practices.
Identity through social positioning: An MRI scanner acquires the social identity "MRI scanner" by being positioned within a system (a hospital, a radiology department, a diagnostic protocol) such that it occupies a social position with associated system functions. If the same device were positioned differently, it would have a different social identity and different system functions. Most human artifacts are designed with their intended position in mind, so there is usually a reasonable fit between intrinsic capacities and intended system functions. But repositioning is possible, and repositioning changes identity and function.
The fertility implication: Digital objects, AI systems in particular, have what might be called fertile form: their intrinsic capabilities support a wide and indeterminate range of possible system functions. The same underlying model can be positioned as a coding assistant, a writing tool, a decision-support system, a customer service agent, or an evaluation mechanism, and each positioning constitutes a different function. This fertility means that the range of possible functions is wider for AI than for most prior technologies, and that positioning drift through practice is more consequential.
Faulkner and Runde explain why the same AI tool produces different proxy-criterion gaps in different organizations. Function is not fixed in the technology. It emerges through positioning practices. When an organization positions AI as a productivity tool (in the market logic sense: throughput, speed, cost reduction), it constitutes functions that make market-logic metrics legible and professional-logic criteria invisible. That constitutive act is not irreversible in principle but is resistant to revision in practice because accumulated competence and institutional routines build around the constituted function.
PSF specifically uses the repositioning possibility to explain how proxy substitution deepens. Initial positioning as a coding assistant constitutes speed-of-output as a legible metric. Practice then drifts: developers prompt for code they review rather than write, which constitutes a different function (production-to-evaluation shift, per Simkute et al., 2025). The organization has not explicitly chosen to reposition the tool. Positioning has drifted through practice. The new function has different proxy-criterion relationships, but the evaluation infrastructure has not updated because the repositioning was not deliberate and therefore not visible to evaluation.
Leonardi (2011) specifies the temporal dimension of the underdetermination. Technologies simultaneously afford and constrain, and which affordances are realised depends on the routines practitioners bring to the engagement. As routines shift through repeated use, different affordances become salient, and the technology's organisational function changes without any deliberate decision to reposition it.
Leonardi and Leavell (2026) provide the cleanest empirical illustration. Two urban planning organisations used the same AI simulation tool but positioned it differently, constituting different functions. One maintained provisionality, treating AI simulation outputs as provisional inputs to planning decisions that still required professional judgment and stakeholder deliberation. The other produced what Leonardi and Leavell term "artificial certainty," presenting simulations as authoritative predictions. The same technical form, positioned differently, produced different patterns of divergence between what the organisations measured and what they were accountable for. The positioning can drift through accumulated use without any deliberate organisational decision, as practitioners habituate to new workflows and metrics consolidate around observable outputs.
Critical PSF specification: Faulkner and Runde do not discuss AI's particular form-fertility. PSF extends their framework by noting that AI's wide constitutive range makes positioning drift more consequential and less visible than with prior digital technologies. A word processor's form is sufficiently constrained that positioning drift is limited. AI's form is sufficiently open that practitioners can engage with fundamentally different functions using the same tool, on the same day, without recognizing the shift.
Connection to Section 1 (Nguyen): Faulkner and Runde's fertile form is the technology-level precondition for Nguyen's value capture. The form constitutes the proxy metrics (speed, volume, certainty) that then colonize practical reasoning through the competitive advantage Nguyen describes. Without fertile form, the proxies would need to be pre-specified by the institution (as in Nguyen's general case). With fertile form, the proxies are constituted through the engagement itself, which is potentially PSF's strongest claim to distinctiveness from Nguyen's general value capture framework.
Connection to Section 1 (Chang): Faulkner and Runde's form/function underdetermination is the structural reason why AI engagement eliminates hard choices. When the technology's form is open enough to be positioned in multiple ways, the positioning that makes metrics most legible wins (Nguyen's competitive advantage). That positioning resolves comparisons that would otherwise require commitment (Chang's parity). The technology's fertility is what makes the elimination of hard choices systematic rather than incidental.
PSF also borrows the form/function commitment methodologically: the synthesis uses "boundary activity" (function) rather than "bridge actor" (form), following directly from Faulkner and Runde's analytical priority.
How observable is positioning drift in practice? Faulkner and Runde theorize repositioning as a possibility. PSF claims it happens through practice without deliberate choice. The empirical phase should probe for evidence of drift: has what the tool does in your daily work changed since you started using it, and was that change something you chose or something that happened?
Is "fertile form" an original PSF construct or a straightforward application of Faulkner and Runde? The form/function distinction is theirs. The observation that AI has particularly wide constitutive range is PSF's extension. Whether this extension is a genuine theoretical contribution or simply a contextual observation about one class of digital objects needs honest assessment.
Does form-fertility vary across AI systems? A narrowly trained classification model has less fertile form than a general-purpose LLM. PSF's mechanism should operate more strongly with more fertile tools. The empirical phase could test this if sites vary in the generality of the AI tools they use.
Literature move: Borrowed and extended. PSF borrows the form/function distinction and the social-positioning account of digital object identity. PSF extends by specifying AI's fertile form as a distinct property that makes constitutive drift more consequential than Faulkner and Runde's general framework requires.
Role in the architecture: Faulkner and Runde operate at the technology-organizational interface. They explain why the same tool produces different organizational effects (positioning varies) and why those effects shift without deliberate choice (positioning drifts through practice). In the causal chain, fertile form is the bridge between Paul's anticipation failure (Section 1) and Nguyen's value capture (Section 1): the technology's openness is what constitutes the specific proxy metrics that then colonize reasoning.
Thornton, Ocasio, and Lounsbury synthesize two decades of institutional logics research into a comprehensive metatheory. An institutional logic is the set of material practices and symbolic systems, including assumptions, values, and beliefs, by which individuals and organizations provide meaning to their daily activity, organize time and space, and reproduce their lives and experiences.
The framework identifies several ideal-typical institutional orders (market, corporation, professions, state, religion, family, community) and specifies the organizing principles each provides. Each logic offers distinct answers to: What is the basis of identity? What is the source of legitimacy? What is the basis of attention? What are the rules for resource allocation?
The critical mechanism: Logics are not chosen. They are inherited. Organizations are embedded in fields already structured by prevailing logics. Through socialization, professional training, industry practices, and institutional pressures, organizations absorb logics before they face any particular decision. By the time an organization evaluates AI, the logics available to it have already constrained what options it can perceive, what criteria it will apply, and what outcomes it will value.
Logics and attention: Building on Ocasio's attention-based view of the firm, Thornton et al. emphasize that logics operate by directing what organizational actors attend to. Market logic focuses attention on efficiency, throughput, and cost metrics. Professional logic focuses attention on craft quality, expertise development, and client outcomes. These attentional structures are not merely preferences. They are infrastructural: they organize what information gets gathered, what metrics get tracked, and what vocabulary is available for articulating evaluation.
Multiple logics and conflict: Most organizations operate under multiple logics simultaneously. A software development organization operates under market logic (profitability, throughput) and professional logic (code quality, engineering judgment, craft). The logics do not fully resolve. They coexist in tension. Which logic dominates resource allocation and evaluation in any given period depends on political, cultural, and institutional processes, not rational optimization.
Thornton et al. is PSF's primary theoretical conversation partner and the literature PSF most directly extends. Market logic makes AI-constituted proxy metrics legible: speed of output, volume of code generated, cost per task. Professional logic makes accountable criteria legible: code robustness, maintainability, the quality of judgment exercised in architectural decisions. AI engagement elevates market-logic metrics because the AI's speed and volume make those metrics more salient and more measurable. Professional-logic criteria require tacit judgment built through consequence exposure. They are harder to articulate and harder to measure.
The displacement is not a choice to abandon professional standards. It is an attentional drift: market logic metrics become more prominent, more frequently discussed, more directly tied to resource allocation decisions. Professional logic criteria remain technically available but recede from active evaluative use. PSF calls this salience decay: the criteria that would detect proxy-criterion divergence do not disappear from memory. They disappear from operative relevance.
The market logic/professional logic asymmetry: This is where PSF's "asymmetric ambidexterity" construct originates. Organizations do not fail at both market logic and professional logic evaluation. They succeed visibly at market logic evaluation while professional logic evaluation degrades invisibly. The asymmetry is driven by the differential legibility of the two logics' metrics under AI engagement: AI constitutes market-logic metrics as observable outcomes while rendering professional-logic criteria less accessible.
Performativity connection: Thornton et al. at the field level, combined with MacKenzie's performativity framework and Cabantous and Gond's three-mechanism taxonomy, produces PSF's account of how proxy metrics become institutionally legitimate before any individual organization tests their validity. Field-level discourse (consultant reports, frontier lab announcements, industry benchmarks) circulates market-logic AI metrics, establishing them as the evaluative vocabulary. Organizations inheriting that vocabulary enter engagement with criteria already shaped by the performative process.
Connection to Section 1 (Nguyen): Thornton et al. explain why the specific proxy metrics Nguyen's value capture predicts will dominate are market-logic metrics rather than some other simplified values. Nguyen's framework is general (any simplified metric can capture reasoning). Thornton et al. specify the institutional channel through which particular metrics gain competitive advantage: market logic makes speed, volume, and cost legible, and AI engagement makes market-logic metrics even more legible than they were before.
Connection to Section 1 (Chang): The logic asymmetry explains why hard choices disappear from organizational experience (Chang's territory). When market logic dominates attention, the parity situations Chang describes (where competing professional and market values are genuinely on a par) get resolved by default. The market-logic metric is available, clear, and defensible. The professional-logic criterion is harder to articulate. The hard choice never surfaces because one logic has already won the attention competition.
Is "salience decay" empirically distinguishable from deliberate deprioritization? An organization might consciously decide to prioritize speed over craft (a strategy choice, not evaluative erosion). The empirical phase needs to distinguish between organizations that have made an explicit trade-off and organizations where professional logic criteria have receded without anyone noticing. The PSF-distinctive finding is the second pattern.
Does the asymmetry hold across domains? PSF's examples are primarily software development (market logic: throughput, professional logic: craft quality). Does the same asymmetry operate in customer support (market logic: resolution speed, professional logic: diagnostic judgment)? In creative work (market logic: output volume, professional logic: originality, voice)? The empirical phase should test across domains.
Can institutional logics explain why some organizations resist proxy seduction? Thornton et al. predict that organizations with dominant professional logics and compatible AI engagement (Besharov and Smith's "aligned" configuration) should show less proxy seduction. If the empirical phase finds such cases, they would support the institutional logics channel. If professional logic dominance does not protect, PSF may need a different account of variation.
How do logics interact with the Paul-Nguyen-Chang sequence? Market logic may accelerate the sequence (making value capture faster and parity elimination more complete). Professional logic may slow it (preserving some hard choices by maintaining competing criteria). The interaction between institutional logics and the individual-level philosophical sequence is an empirical question the interview data can address.
Literature move: Borrowed and extended. Thornton et al. is the literature PSF most directly extends. The extension: institutional logics explains why certain logics dominate attentional structures. PSF explains how AI engagement systematically elevates one logic (market) while occluding another (professional) through a constitutive process that the dominant logic cannot detect because its metrics are what AI engagement produces.
Role in the architecture: Thornton et al. operate at the organizational attention level. They explain how organizations come to attend to proxy metrics rather than accountable criteria, not through a strategic choice but through the differential legibility of competing logics under AI engagement. In the causal chain, institutional logics are the organizational channel through which individual-level value capture (Nguyen, Section 1) scales: market logic amplifies proxy elevation, and the logic asymmetry ensures the degradation remains invisible.
Performativity is not one theory but a family of related claims about how descriptive apparatuses (theories, metrics, categories, material infrastructures) shape the phenomena they purport to describe. The family traces its lineage through Austin's account of performative utterances and Butler's account of reiterative constitution, into sociological studies of economic performativity (MacKenzie, Callon), organizational adaptations (Cabantous and Gond; Gond, Cabantous, Harding and Learmonth), and sociomaterial treatments of digital and AI performativity (Orlikowski and Scott). PSF draws on this family for a specific reason. The framework requires an account of how AI proxy metrics become the legitimate evaluative vocabulary before any individual organization tests their validity, and of how those metrics constitute the evaluation apparatus itself rather than merely mismeasuring it. No other body of theory in the synthesis does this work.
Within the performativity family, PSF operates in the ontological register Faulkner and Runde operate in (Section 2.1), which treats digital objects as having real form (material and nonmaterial properties with capacities that exist independent of any particular use) while treating function, identity, and social meaning as constituted through positioning in communities of use. Form sets the space of possible positionings. Function flows from positioning within that space.
A distinct ontological register in the performativity family, drawing on Barad's agential realism, resists this analytical separation. In the agential-realist register, form and function are co-enacted through sociomaterial practices, and the apparent stability of the object itself is a performative effect of the practices that produce it. Scott and Orlikowski (2025) develop this register specifically for AI. The two registers share commitments to constitutive performativity and to attention to exclusion and accountability. They differ on whether cross-case generalization is legitimate: the register Faulkner and Runde operate in permits it (forms can be compared across cases even as functions vary), while the agential-realist register treats such generalization as risking the reification of configurations that were contingent.
PSF's distinctive contribution operates across these strands. The framework names a mechanism (proxy metrics constituted by engagement, evaluative capacity eroded through the same process) that the foundational work does not specify, connects field-level performative dynamics to organizational-level logic asymmetry (Thornton et al.) and individual-level value capture (Nguyen), and does so in a register that permits the conditional generalization the mechanism claim requires. The section closes with a sustained treatment of what PSF adds and why that addition requires the register in which PSF operates.
MacKenzie, D. (2006) An Engine, Not a Camera: How Financial Models Shape Markets. Cambridge, MA: MIT Press. See also MacKenzie (2006) 'Is economics performative? Option theory and the construction of derivatives markets', Journal of the History of Economic Thought, 28(1), pp. 29-55.
What MacKenzie argues: MacKenzie develops a three-level taxonomy of performativity to explain how economic theories do not merely describe markets but actively reshape them. Generic performativity: a theory or model is used in practice and has some effect on the world. Effective performativity: the model actively changes economic processes in ways that would not have occurred without it. Barnesian performativity: use of the theory makes the world more closely resemble what the theory describes. Named after sociologist Barry Barnes, who described society as a distribution of self-referring knowledge substantially confirmed by the practice it sustains. Black-Scholes did not just describe option prices. As traders adopted it, actual option prices converged on what the formula predicted. The counter-performativity twist: after the 1987 crash, widespread adoption had produced correlated behavior that violated the model's assumptions. The model's very success generated the conditions for its failure.
Callon, M. (2007) 'What does it mean to say that economics is performative?', in MacKenzie, D., Muniesa, F. and Siu, L. (eds.) Do Economists Make Markets? On the Performativity of Economics. Princeton: Princeton University Press, pp. 311-357.
What Callon adds: Callon extends MacKenzie's framework by emphasizing the material and infrastructural conditions that enable performativity. Theories do not perform reality on their own. They perform through socio-technical agencements: the specific configurations of instruments, institutions, practices, and material arrangements that translate theory into action. The implication: performativity is not inevitable. It depends on the strength and stability of the agencements through which theories are enacted.
What PSF does with MacKenzie and Callon: MacKenzie provides the mechanism for field-level proxy constitution. AI discourse does not merely describe what AI does to organizations. At the Barnesian level, it constitutes the evaluative criteria organizations bring to engagement. When frontier labs, consultants, and industry commentators declare that AI increases throughput, reduces latency, and democratizes expertise, organizations absorbing that discourse evaluate their own AI engagement using precisely those criteria. The discourse produces outcomes that confirm it, which generates further discourse of the same kind. PSF's extension: standard Barnesian performativity describes how discourse shapes the phenomenon it describes. PSF specifies that the performative effects operate on the evaluation apparatus itself, not just on organizational practice. This is a claim MacKenzie's framework does not make and cannot make without PSF's constitutive mechanism.
Callon's agencements framework specifies the material infrastructure through which AI's performative effects operate: AI benchmarks (which encode market logic, per Haupt and Brynjolfsson, 2025), consultant reports (which frame AI value in efficiency terms), industry conferences (where success narratives gain legitimacy), and vendor demonstrations. These are not neutral information channels. They are the material infrastructure through which particular evaluative criteria get constituted as the way to assess AI. Temporal implication: field-level discourse operates faster than organizational engagement. Organizations enter engagement with criteria already shaped by the performative process. Counter-performativity analog: the proxy-criterion divergence accumulating through engagement may eventually exceed the field's capacity to suppress it. The Stack Overflow declining trust data, the DORA elite performer findings, and emerging practitioner resistance could be read as early counter-performative signals.
Cabantous, L. and Gond, J-P. (2011) 'Rational decision-making as performative praxis: Shedding light on the performativity of theory in the social sciences', Organization Science, 22(3), pp. 573-586.
What Cabantous and Gond argue: Rational decision-making is not merely described by economic theory but performed into existence through specific material and discursive practices. They identify three mechanisms through which performativity operates. Conventional performativity operates through the taken-for-grantedness of metric infrastructure: the tools, templates, and procedures that encode a theoretical framework become so routinised that their status as constructed artefacts disappears from view. Generic performativity operates through active production of the conditions the framework describes: consulting frameworks that measure AI success through throughput metrics produce client organisations that optimise against throughput, confirming the framework's validity in a self-fulfilling loop. Framing performativity operates through the field-level discourse that establishes what counts as evidence before any organisation tests the claim operationally.
What PSF does with it: Cabantous and Gond's taxonomy maps the structure of field-level proxy constitution precisely. Conventional performativity: vendor benchmarks, analyst reports, and maturity models become the default evaluation vocabulary. Generic performativity: consulting frameworks produce client organisations that optimise against throughput, confirming validity in a self-fulfilling loop. Framing performativity: field-level discourse establishes what counts as evidence before any organisation tests the claim operationally. MacKenzie supplies the constitutive logic. Cabantous and Gond supply the channels. The three-mechanism taxonomy maps precisely onto PSF's account of executive discourse (framing), metric infrastructure (conventional), and consulting frameworks (generic).
Gond, J-P., Cabantous, L., Harding, N. and Learmonth, M. (2016) 'What Do We Mean by Performativity in Organizational and Management Theory? The Uses and Abuses of Performativity', International Journal of Management Reviews, 18(4), pp. 440-463.
What Gond et al. argue: The canonical review of performativity in organization and management theory. They trace performativity's migration from Austin through Lyotard, Butler, Callon, and Barad, documenting how OMT scholars have used and sometimes abused these foundational sources. Their central diagnosis: OMT has either borrowed performativity loosely without capitalizing on its theoretical resources, or developed variants (critical performativity most prominently) that misread the foundational texts. Their prescription: a "performative turn" in OMT that takes the concept's constitutive claims seriously and develops organizational conceptualizations of how performativity is organized. A central critique: the absence of organizational conceptualizations of performativity and of analysis of how performativity is organized.
What PSF does with it: Gond et al. is the load-bearing anchor for the IJMR paper and a significant positioning resource for the OS paper. For IJMR, the argument is direct: Gond et al. called for a performative turn in OMT in 2016, identifying specifically the absence of organizational conceptualizations of how performativity is organized. PSF responds to that call. The three-level propagation account (field discourse shapes organizational logic asymmetry shapes individual value capture) is an organizational conceptualization of how performativity is organized. The evaluative capacity architecture (detection, judgment stock, braking) specifies what performativity does when organized in a particular way: it constitutes the evaluation apparatus such that the criteria required to detect the constitution are themselves products of the constitution. The paper's publication in IJMR (PSF's target journal for the problematization paper) makes this positioning an argumentative opportunity rather than just a citation.
Cabantous, L., Gond, J-P., Harding, N. and Learmonth, M. (2016) 'Critical Essay: Reconsidering Critical Performativity', Human Relations, 69(2), pp. 197-213.
What Cabantous et al. argue: Companion critical essay to Gond et al. 2016. "Critical performativity" (developed by Spicer, Alvesson, Fournier, and others) misreads foundational performativity authors, particularly Austin and Butler, in ways that nullify their political potential. Critical performativity proposes practical interventions in organizational life on the basis of performativity theory but does so without sufficient grounding in the foundational conceptualizations. The essay also emphasizes the materiality of performativity, which critical performativity tends to underplay.
What PSF does with it: Less load-bearing than Gond et al. 2016 but useful for positioning. PSF is not critical performativity in the Spicer-Alvesson sense. PSF is constitutive performativity in the MacKenzie-Cabantous-Gond sense. The distinction matters because critical performativity is the most visible OMT performativity tradition in some subfields. Reviewers reading PSF through that frame may ask why PSF does not engage the critical performativity literature on its own terms. The answer: PSF operates in a different tradition with different theoretical commitments, and the Cabantous et al. 2016 essay documents why that distinction matters. Citation is primarily a positioning move rather than a mechanism extension.
The three Orlikowski and Scott papers that follow share performative and constitutive commitments with the register PSF operates in, but differ from it on ontological grounds. Drawing on Barad's agential realism, Scott and Orlikowski treat what other registers call "the object" and "the practice" as co-enacted through sociomaterial performance rather than as analytically separable elements. In their strongest formulation, the apparent stability of a digital object (a model, a platform, a standard) is itself a performative effect of the practices that sustain it; there is no form that exists prior to and independent of the practices that constitute it.
This ontological commitment has consequences for cross-case generalization. Scott and Orlikowski's framework permits, and calls for, rich genealogical accounts of how particular configurations come to be. It resists the move of identifying patterns that recur across configurations, because such identification risks reifying contingent configurations into apparent universal features. PSF operates in a different register. PSF claims that a specific mechanism recurs when organizations engage with AI under specified conditions (market-logic dominance, fertile form, intensive and prolonged engagement). This claim requires treating the form of the engaged-with AI as sufficiently stable across cases to permit comparison, which is the move the register Faulkner and Runde operate in (Section 2.1) permits and the agential-realist register does not. The three papers that follow are therefore read as adjacent work in a different ontological register, not as ally theorizing PSF extends.
Orlikowski, W.J. and Scott, S.V. (2014) 'What happens when evaluation goes online? Exploring apparatuses of valuation in the travel sector', Organization Science, 25(3), pp. 868-891.
What Orlikowski and Scott argue: Study of TripAdvisor as an apparatus of valuation in the hospitality industry. The paper develops a sociomaterial account of how online evaluation platforms reconfigure the evaluative criteria of an industry. Traditional hospitality valuation (Michelin, AA, tourist board ratings) depended on formal professional audit criteria. The rise of online review platforms produces a new apparatus of valuation that operates through aggregated non-professional judgments at scale, algorithmic ranking, and continuous real-time reconfiguration. The apparatus does not merely measure hotel quality differently. It constitutes different criteria for what counts as hotel quality. Individual hotels adapt their practices. Industry norms shift. The formal accreditation apparatuses lose authority.
What PSF does with it: The closest existing empirical instantiation of a PSF-style mechanism in a non-AI context, which makes the citation useful even across ontological registers. The "apparatus of valuation" term maps onto PSF's "evaluation apparatus." The finding that the platform constitutes new criteria rather than measuring prior criteria differently parallels PSF's proxy-constitution claim in an earlier technological register. PSF extends the pattern to AI by specifying that AI's fertile form (Faulkner and Runde) makes the constitutive process more consequential than online review platforms were. The empirical observation carries regardless of ontological register. The theoretical framing PSF uses for it (form permits analysis, positioning does constitutive work) differs from the framing Orlikowski and Scott would use (form and practice co-enacted, no analytical separation possible).
Scott, S.V. and Orlikowski, W.J. (2022) 'The Digital Undertow: How the Corollary Effects of Digital Transformation Affect Industry Standards', Information Systems Research, 33(1), pp. 311-336. Extended in Orlikowski, W.J. and Scott, S.V. (2023) 'The Digital Undertow and Institutional Displacement: A Sociomaterial Approach', Organization Theory, 4(2), pp. 1-16.
What Scott and Orlikowski argue: The "digital undertow" is the pattern of indirect institutional displacement produced by digital transformation. Direct digital changes produce visible disruptions. Underneath them, institutional apparatuses (standards, accreditation schemes, professional authorities) are displaced by corollary effects of the transformation that are not themselves the intended target of transformation. The 2022 ISR paper uses the ISBN standard in publishing as the empirical case. The 2023 Organization Theory paper extends the framework theoretically and argues that "strong sociomateriality" grounded in agential realism provides the appropriate analytic for studying the undertow.
What PSF does with it: The digital undertow framework is analytically parallel to PSF's asymmetric ambidexterity construct but operates at a different level. Scott and Orlikowski document institutional displacement as a sociomaterial consequence of digital transformation generally. PSF specifies a particular kind of institutional displacement: the displacement of professional-logic evaluative criteria by market-logic proxy metrics through AI engagement. The relationship is nesting at the empirical level (PSF's mechanism is one undertow pattern) with ontological difference at the theoretical level.
Scott, S.V. and Orlikowski, W.J. (2025) 'Exploring AI-in-the-making: Sociomaterial genealogies of AI performativity', Information and Organization, 35(1), 100558.
What Scott and Orlikowski argue: The most recent and most directly AI-focused paper in this strand. AI should be studied not as a fixed "thing" but as "phenomena in-the-making." Methodological proposal: sociomaterial genealogical inquiry, a method that traces how particular AI configurations come to be stabilized and what they enact. The theoretical anchor: Barad's agential realism. The paper provides a framework for orienting qualitative research toward the performativity of ongoing AI reconfigurations and sociomaterial accountabilities. It identifies performative effects and sociomaterial exclusions as the core objects of inquiry. It extends the digital undertow framework to AI specifically.
What PSF does with it: The most delicate positioning relationship in the synthesis. Scott and Orlikowski (2025) and PSF share significant empirical commitments: AI as constituted through engagement rather than fixed in its effects, attention to what is excluded when configurations stabilize, recognition that performativity operates at multiple levels. The relationship PSF claims is adjacent work in a different ontological register, not ally with distinct contribution. Two specific points of divergence. First, PSF's mechanism claim requires conditional generalization across cases (proxy seduction recurs when organizations engage with AI under specified conditions). Scott and Orlikowski's framework, taken strictly, treats such generalization as risking the reification of contingent configurations. The register Faulkner and Runde operate in, in which PSF operates, permits the generalization; the agential-realist register resists it. Second, Scott and Orlikowski's contribution is methodological and ontological (how to study AI performatively). PSF's contribution is a specific mechanism that operates on the empirical terrain both frameworks address. These are complementary but different kinds of theoretical move. The positioning needs to be made explicit in the OS paper rather than left implicit, because reviewers familiar with Scott and Orlikowski will test whether PSF genuinely operates in a different register or whether it smuggles in structural claims that their framework would resist. The answer: PSF operates in a different register (the one Faulkner and Runde operate in), acknowledges agential realism as a legitimate alternative register, and positions its mechanism claim as conditional and register-dependent rather than universal.
The performativity literature provides PSF with indispensable theoretical resources. The framework's claim that field-level AI discourse constitutes the evaluative categories organizations bring to engagement depends on Barnesian performativity (MacKenzie). The claim that conventional, generic, and framing performativity operate as distinct channels depends on Cabantous and Gond. The claim that performativity can be studied as organizational phenomena responds to Gond, Cabantous, Harding and Learmonth's call for a performative turn in OMT. The recognition that digital apparatuses reconfigure industry evaluative criteria draws on Orlikowski and Scott. What does PSF add that this literature does not already provide? The contribution has three components.
First: performativity applied to the evaluation apparatus itself, not to practice or outcomes. MacKenzie and Callon theorize how performativity reshapes the phenomenon being described (market prices, option values). Cabantous and Gond theorize how performativity reshapes decision-making practices. Scott and Orlikowski theorize sociomaterial performativity at the level of practices and institutional apparatuses. None of them specifies how performative effects operate on the evaluative criteria used to assess AI engagement. PSF's claim is that AI discourse constitutes the evaluation apparatus itself, not just the practices it evaluates or the outcomes it measures. The criteria by which organizations judge whether AI engagement is working are themselves products of the performative process. This matters for PSF's mechanism because it explains why detection fails: the evaluative capacity required to detect proxy-criterion divergence is constituted by the same performative process that produces the divergence.
Second: integration of performativity with institutional logics and individual-level value capture. Performativity at field level (MacKenzie, Callon) meets institutional logics at organizational level (Thornton et al.) meets value capture at individual level (Nguyen) to produce PSF's three-level propagation mechanism. Each level reinforces the others through feedback loops: field discourse constitutes proxy metrics that shape organizational attention that channels individual practitioner reasoning that produces outcomes that feed back into field discourse. The integration is not novel in its individual links. MacKenzie has been connected to institutional theorizing before; value capture has been connected to attentional structures before. The three-level propagation as a unified mechanism running across field, organization, and individual, with specified feedback loops at each transition, is a specific PSF contribution. The integration also answers Gond, Cabantous, Harding and Learmonth's call for organizational conceptualizations of how performativity is organized: PSF names the organization of performativity at three levels and specifies what each level contributes and how the levels connect.
Third: the sincere-belief distinction, which depends on performativity. PSF's claim that proxy substitution operates through genuine organizational belief rather than strategic gaming requires the constitutive premise performativity provides. If evaluation criteria were stable independent of engagement, the only way for organizations to optimize against the wrong criteria would be strategic error (Goodhart's Law). With constitutive performativity, the criteria themselves are produced by the engagement, and organizations optimizing against them are optimizing against criteria they sincerely hold, which happen to be criteria the engagement produced. This distinction is what separates PSF from Goodhart-family accounts and is what requires performativity as the theoretical ground. The sincere-belief claim also explains why PSF predicts that training and awareness interventions will not reliably correct proxy seduction: the criteria practitioners would need to apply to recognize the substitution are themselves products of the process that produced the substitution. Fernandes et al. (Section 3.1) provide empirical support: higher AI literacy correlates with lower metacognitive accuracy, exactly what the sincere-belief claim predicts.
Why the ontological register matters. The three contributions above share a structural feature: each claims that a specific mechanism recurs across organizational cases. Proxy constitution recurs. The three-level propagation recurs. The sincere-belief dynamic recurs. These claims require treating the engaged-with AI as sufficiently stable across cases to permit comparison. PSF inherits this permission from Faulkner and Runde's form-function distinction (Section 2.1), which treats the form of the digital object as having real properties independent of any particular organizational engagement, while treating function, identity, and social meaning as constituted through positioning in communities of use. The form is the stable element that underwrites cross-case comparison. The functions vary. PSF's mechanism claim is that the variation is patterned rather than random, and that the pattern is specifiable under boundary conditions the framework names. The agential-realist register in which Scott and Orlikowski (2025) operate does not permit this move. In their strictest formulation, each organizational AI engagement produces its own enactment, with only family resemblance connecting them. PSF cannot inherit from this register without abandoning the mechanism claim that makes it worth reading. This is not a rejection of Scott and Orlikowski's framework. Their framework is doing different theoretical work, in a different register, for different purposes. PSF is one register's answer to a mechanism question that agential realism, by design, resists answering.
Three evaluative capacity dimensions under investigation: detection (whether practitioners can sense divergence between proxy and criterion), judgment stock (whether practitioners have consequence-built tacit knowledge sufficient to discriminate proxy from criterion), and braking (whether evaluative institutional infrastructure can interrupt displacement once underway).
Each dimension draws on primary and supporting literature. Primary sources do bounded work that no other source does for that dimension. Supporting sources provide texture, empirical grounding, or reviewer-defense capability.
Detection is whether practitioners can sense that proxy metrics have diverged from accountable criteria. Detection failure is not ignorance. It is the structural inability to perceive a divergence when the evaluative vocabulary available does not encode the divergence as a recognizable category.
Shaw, S.D. and Nave, G. (2026) 'Thinking: Fast, slow, and artificial: How AI is reshaping human reasoning and the rise of cognitive surrender', SSRN. DOI: 10.2139/ssrn.6097646.
What Shaw and Nave argue: Shaw and Nave introduce a "Tri-System Theory" that adds System 3 (AI-augmented cognition) to Kahneman's dual-process model. Through a series of controlled experiments, they document "cognitive surrender": the systematic pattern by which human reasoning defers to AI outputs regardless of accuracy. Participants adopted AI outputs on roughly 80% of faulty trials and showed inflated confidence despite errors. Accuracy became a function of AI output quality across studies, a large effect. Access to AI increased confidence by approximately 12 percentage points even when half the outputs were wrong. Shaw and Nave identify a dose-response relationship: as System 3 usage increased, participants' accuracy increasingly tracked AI accuracy rather than reflecting independent judgment. Study 3 tested recalibration conditions. Per-item financial incentives plus immediate correctness feedback caused participants to reject incorrect AI outputs at more than twice the baseline rate. Recalibration worked because three conditions held: feedback was immediate, accuracy signals were unambiguous, and consequences were personal.
What PSF does with it: Shaw and Nave provide the cognitive micro-foundation for judgment stock erosion and detection failure. Cognitive surrender describes how practitioners' criterion-level evaluation becomes coupled to AI-output quality without reliable self-awareness of that dependency, which is the individual-level instantiation of what PSF calls judgment stock erosion. Study 3 provides the basis for rejecting the "practitioners can just notice" objection: recalibration is possible but requires conditions (immediacy, unambiguity, personal consequence) that organizational settings do not provide. This supports PSF's sincere belief claim: practitioners are not choosing to ignore criterion-level evidence; their metacognitive monitoring has been suppressed by the mechanism Shaw and Nave document. The dose-response relationship maps onto engagement depth: lighter users retain more independent judgment while heavy users' accuracy increasingly tracks AI accuracy.
Ocasio, W., Laamanen, T. and Vaara, E. (2018) 'Communication and attention dynamics: An attention-based view of strategic change', Strategic Management Journal, 39(1), pp. 155-167.
What Ocasio et al. argue: Ocasio, Laamanen, and Vaara extend the attention-based view of the firm to incorporate communication as a mechanism of strategic change. Their central insight: organizational attention is not merely influenced by communication but constituted through it. The vocabularies available to managers shape what they can attend to, and therefore what they can perceive as requiring strategic response. Managers "may miss opportunities and threats that they cannot comprehend with existing vocabularies."
What PSF does with it: Ocasio et al. ground the articulation failure dimension of detection. Detection requires not just that practitioners sense proxy-criterion divergence (which tacit knowledge enables) but that they can articulate it in terms that institutional vocabularies can receive. The automation-logic vocabulary offers "productivity," "efficiency," and "time savings." These terms cannot encode "I spent three hours debugging code that looked correct but contained a subtle error I would never have introduced myself." The learning resists institutionally legible form. Practitioners have the experience and attempt to articulate it, but the articulation does not travel up to update institutional expectations because the institutional vocabulary cannot accommodate it. The result is continued engagement under declining trust. The vocabulary gap is the gap that prevents practitioner experience of proxy-criterion divergence from traveling up organizational channels as a signal that would activate the braking function.
Weber, K. and Glynn, M.A. (2006) 'Making Sense with Institutions: Context, Thought and Action in Karl Weick's Theory', Organization Studies, 27(11), pp. 1639-1660.
What Weber and Glynn argue: Weber and Glynn address a persistent criticism of Weick's sensemaking theory: that it neglects the role of larger social and historical contexts. They argue that institutions provide the raw materials for sensemaking. When organizational actors face ambiguous situations, they do not construct meaning from scratch. They draw on institutionally provided vocabularies, categories, identities, and scripts. Weber and Glynn identify three mechanisms by which institutions shape sensemaking: priming (makes certain interpretations more accessible), editing (filters out interpretations that violate institutional expectations), and triggering (activates sensemaking when institutional prescriptions conflict or fail).
What PSF does with it: Weber and Glynn supply the attentional mechanism for PSF's displacement process. Priming explains why market-logic metrics surface naturally when practitioners evaluate AI-assisted work: efficiency, throughput, and speed are the categories the institutional environment makes most accessible. Editing explains why professional-logic criteria are filtered out of upward-traveling organizational discourse: they violate the expectations of the automation-logic vocabulary that dominates the field. Triggering explains when the filtering breaks down: when proxy-criterion divergence becomes sufficiently severe to create visible institutional failure. But even triggered sensemaking draws on institutional resources; it does not escape them. PSF uses Weber and Glynn to specify the vocabulary-attention chain: institutional logics supply available vocabularies, vocabularies constrain attention, constrained attention produces institutionally shaped sensemaking, and the chain operates to render proxy-criterion divergence institutionally invisible even when practitioners have tacit experience of it.
Endsley, M.R. (2017) 'From here to autonomy: lessons learned from human-automation research', Human Factors, 59(1), pp. 5-27.
What Endsley argues: Endsley synthesizes decades of research on human-automation interaction. Her situational awareness model distinguishes three levels: perception of relevant elements in the environment (Level 1), comprehension of what those elements mean (Level 2), and projection of future states given current trends (Level 3). Automation typically handles Level 1 (perception) while humans are expected to maintain Level 2 and Level 3 awareness. The problem: Level 2 and 3 situational awareness depends on actively processing Level 1 information. If automation removes the need to perceive, comprehension and projection capacity atrophies.
What PSF does with it: Endsley's situational awareness model maps onto PSF's judgment stock dimension with specific precision: Level 2 and Level 3 situational awareness are the cognitive substrates of proxy-criterion discrimination. A practitioner who can only operate at Level 1 (perceiving AI-generated code as output) but has lost Level 2 (comprehending what the code actually does in context) and Level 3 (projecting whether it will hold up under future conditions) cannot discriminate proxy from criterion. AI engagement that removes active Level 1 construction therefore directly degrades the judgment stock that PSF identifies as the prerequisite for detection.
Endsley, M.R. (2023) 'Situation Awareness in Human-AI Systems', Journal of Cognitive Engineering and Decision Making, 17(2), pp. 87-98.
What Endsley argues: Endsley's updated analysis addresses AI-specific situational awareness challenges. Generative AI creates novel situational awareness problems because AI outputs can be plausible, detailed, and wrong in ways that are not flagged by any system indicator. Traditional automation provides clear status signals (the autopilot is engaged; the alarm has triggered). Generative AI produces fluent prose or code that looks correct while containing subtle errors. There is no status indicator for "this output is subtly wrong." The practitioner must independently evaluate the output, which requires the very situational awareness that AI assistance has not been designed to maintain.
What PSF does with it: Endsley (2023) provides specific support for PSF's detection dimension: the inability to detect proxy-criterion divergence is not just a matter of attentional drift or institutional vocabulary constraints. It is also a situational awareness problem: the outputs that most diverge from criterion-level quality are often the outputs that look most fluent and plausible, which are precisely the outputs that trigger lower-vigilance evaluation. This is why PSF's detection claim is not just organizational but cognitive: even practitioners with sufficient judgment stock may fail to apply it consistently because the proxy metric (output fluency and plausibility) triggers an evaluative mode that is calibrated to the wrong surface.
Messeri, L. and Crockett, M.J. (2024) 'Artificial intelligence and illusions of understanding', Nature.
What Messeri and Crockett argue: AI tools in scientific research generate outputs carrying the formal markers of epistemic insight without the underlying process, producing what the authors term "illusions of understanding." The formal markers (proper citations, professional prose, correctly structured analyses) have been reliable quality signals throughout practitioners' careers.
What PSF does with it: Messeri and Crockett identify a parallel structure to PSF's detection failure in scientific research. Detection difficulty compounds because the formal markers that trigger learned inferences of quality are precisely the markers AI reliably produces. Overriding that learned inference requires active epistemic confidence, the capacity to trust one's own judgment when the system's output projects authority. This is what erodes under sustained engagement.
Orlikowski, W.J. and Gash, D.C. (1994) 'Technological Frames: Making Sense of Information Technology in Organizations', ACM Transactions on Information Systems, 12(2), pp. 174-207.
What Orlikowski and Gash argue: Orlikowski and Gash develop the concept of technological frames to explain how different groups within an organization understand and engage with new technology. Technological frames are the assumptions, expectations, and knowledge people use to understand technology's nature, purpose, and value. Their central finding, drawn from a study of groupware implementation, is that different organizational groups develop different technological frames for the same technology. Technologists understood the groupware through a collaborative work frame. Managers understood it through an efficiency frame. Users understood it through a task-completion frame. Frame incongruence across groups produced implementation problems that no single group could diagnose. Frames also exhibit inertia: initial frames shape what people notice about the technology, which reinforces the frame.
What PSF does with it: Orlikowski and Gash's frame incongruence is the observable symptom that PSF predicts at the meso level: practitioners whose criteria have been constituted through direct AI engagement hold a different frame from managers whose criteria have been constituted by field-level discourse. The inertia property explains why meso-level boundary activity is insufficient to surface proxy-criterion divergence: frame inertia means the manager's automation-logic frame is self-confirming. The interview protocol for PSF's empirical phase should probe not just whether frame incongruence exists but whether the incongruence is itself a signal of proxy seduction.
Goldschmidt, G. (1991) 'The dialectics of sketching', Creativity Research Journal, 4(2), pp. 123-143.
What Goldschmidt argues: Goldschmidt studied how designers actually think while designing. She identifies a cognitive dialectic essential to productive design: the oscillation between "seeing-as" and "seeing-that." Seeing-as is categorical perception: perceiving something as an instance of a category. Seeing-that is noticing properties: perceiving what is actually on the page without immediately categorizing it. Productive design requires continuous oscillation between these modes. Seeing-as without seeing-that becomes rigid. Seeing-that without seeing-as becomes meaningless.
What PSF does with it: Goldschmidt's seeing-as/seeing-that dialectic provides a cognitive mechanism for detection failure: practitioners who have fully constituted the proxy-as-criterion frame (seeing-as: "this output is productive") lose the capacity to notice features that would prompt reinterpretation (seeing-that: "this output looks right but contains a subtle error I would not have introduced"). PSF can invoke this as a cognitive mechanism for detection failure without requiring the full premature arrest architecture.
Rathje, S. and Van Bavel, J.J. (2026) 'How AI can fuel confirmation bias', OSF preprint.
What Rathje and Van Bavel argue: Short conceptual synthesis identifying three mechanism families through which AI may amplify confirmation bias: biased information search (through biased prompting and sycophantic response), biased interpretation (via the "bias blind spot" and naive realism, with users treating belief-confirming AI as unbiased while viewing gently challenging AI as highly biased), and biased memory (through AI's persistent memory features creating an "invisible layer" of bias that permeates future interactions). The authors identify overconfidence as a compounding outcome, drawing on Fernandes et al. on metacognitive miscalibration and on their own Rathje et al. 2025 sycophancy RCT on inflated self-perception. They acknowledge that AI can equally support truth-seeking and that the outcome depends on design choices and user orientation, framing the phenomenon as contingent rather than inevitable. The piece coins "echo chamber of one" and "Library of Babel for rationalizations" as conceptual labels for the personalised and near-infinite character of AI-generated rationalisations.
What PSF does with it: Rathje and Van Bavel's three-mechanism taxonomy is a cognitive-psychology ally from a different disciplinary direction, arriving at PSF-adjacent conclusions without the organisational or institutional framing. The mechanisms map imperfectly onto PSF's three evaluative capacity dimensions: biased search degrades detection by making confirming evidence more accessible; biased interpretation degrades the evaluative apparatus through the bias blind spot; biased memory produces the cumulative lock-in PSF's judgment stock erosion describes. The piece is strongest on the detection dimension, where its account of users rating sycophantic AI as unbiased while rating challenging AI as biased provides the cognitive micro-foundation for PSF's claim that proxy seduction is self-concealing. Three PSF extensions the piece does not make: the evaluative-capacity architecture as an integrated framework rather than a list of bias types; multi-level propagation from cognitive to organisational to field; and the material-braking argument for why AI differs structurally from prior information technologies. The piece invokes the "amplifier" metaphor throughout, which shares the weakness PSF identifies in Foss's retreat: the metaphor presupposes stability of what the engagement actually reshapes. Despite that weakness, the piece is a useful cognitive-psychology bridge citation, and the empirical RCT it draws on (Rathje et al. 2025) belongs in 5.1 Direct Evidence.
Judgment stock is the consequence-built tacit knowledge that enables practitioners to discriminate between proxy metrics and accountable criteria. It is built through the feedback loop connecting evaluation to consequences. It is not a stable endowment but a developmental achievement that requires ongoing practice to maintain and extend.
A specific pattern within judgment stock erosion warrants naming: capacity drawdown with delayed surfacing. The proxy-constituted optimization loop does not only displace the criterion. It draws down the substrate (individual skill, collective tacit knowledge, community infrastructure, evaluative capacity itself) that produced the original proxy-criterion correlation. The measurement apparatus continues to report the proxy trajectory as success. The substrate depletion is invisible to the measurement frame. At some threshold, the substrate can no longer sustain the correlation. Conditions shift (novelty, competition, demand for polimorphic judgment), and the depleted substrate manifests as proxy collapse. The break is sudden from inside the measurement frame. It is not sudden from outside it. Capacity drawdown operates at multiple organizational layers: individual (where Shen and Tamkin's 17% comprehension gap and Bastani et al.'s metacognitive decoupling document the substrate depletion in educational contexts, and Beane's three conditions specify what developmental infrastructure is being drawn down), collective (the community of practice whose transmission infrastructure has weakened), ecosystem (the platform whose creator base has homogenized, see Section 4.4 Wan et al. for the ByteDance illustration), and evaluative (the apparatus that would detect the drawdown is itself subject to drawdown). This multi-layer property distinguishes capacity drawdown from simple skill atrophy.
The four layers above (individual, collective, ecosystem, evaluative) cover much of what capacity drawdown draws down, but they do not cleanly capture a layer where recent observations push for separate articulation. The bundled employment structure within which judgment formation happens is itself a substrate that drawdown can deplete. Long-tenure employment with bundled responsibilities creates conditions for tacit pattern recognition to develop, for cross-functional triangulation, for institutional memory accumulation, and for embeddedness in relationships that make judgment legible across a team. Each is a condition under which judgment can grow and circulate at scale beyond a single practitioner. The conditions sit between the collective layer (community of practice, partly informal) and the ecosystem layer (platform or field), and warrant separate articulation when AI engagement renders work into legible-and-transferable task units. Once work is rendered into such units, the institutional form that bundled work into long-tenure positions begins to look like overhead. Contractor substitution becomes thinkable at the discrete-task level. Benefits packages, retention programs, and onboarding investments lose the implicit justification that came from treating evaluator capacity as an asset built over time. The institutional form dissolves not by deliberate decision but by drift, as the evaluative case for maintaining it weakens at each point of decision.
Anchors at the institutional-form layer are distinct from the existing four layers. Deloitte 2026 (revenue +5%, parental leave halved, pension accruals eliminated, fertility assistance scrapped, with the Wharton commentator's "with the job market slack, they feel they can" as published candor) is the cleanest contemporary case. The Google contractor-FTE ratio (121,000 to 102,000 by 2019) shows the trajectory was underway before the current AI engagement wave. Massenkoff and McCrory documents occupational-level exposure gaps and pipeline hollowing as observable patterns at the employment-structure layer. The AI Code Glut case shows the layer drawdown in compressed form: experienced engineers being laid off at the same firms producing AI-generated review backlogs.
The discourse driving drawdown at this layer is also distinct. Diamandis's "organizational singularity" claim treats Coasean substitution as accomplished. Dykstra's social-contract framing accepts the premise and runs a left-flank policy argument from it. Both illustrate framing performativity (Cabantous and Gond) operating at the institutional-form layer: field-level discourse establishes what counts as evidence before any organization tests the claim, making contractor-ization decisions defensible at the level of strategy, accelerating the drawdown the discourse describes.
The fifth layer is conditional on three falsification probes: a substitution probe (whether long-tenure FTE bundling preserves evaluative capacity that contractor-task substitution does not, at matched AI engagement), a fungibility probe (whether contractor populations with sufficient tenure develop equivalent judgment-formation conditions), and a reduction probe (whether the layer reduces to the collective plus ecosystem layers without loss). See the working note for the full development.
Beane, M. (2024) The Skill Code: How to Save Human Ability in an Age of Intelligent Machines. New York: Harper Business.
What Beane argues: Beane synthesizes over a decade of ethnographic research across more than 30 occupations and professions to identify what enables human skill development. His framework, the "skill code," identifies three conditions that must be present for novices to develop expertise: Challenge, Complexity, and Connection. Challenge means working near but not beyond the edge of current capability. Skill develops when people struggle with tasks that stretch them. Remove the challenge (by automating difficult tasks or by protecting novices from difficulty), and skill development stalls. Complexity means engaging with the broader system, not just isolated tasks. Connection means relationships of trust and respect between experts and novices. Skill transfer is fundamentally social. Beane's central claim: AI engagement typically degrades all three conditions. AI handles challenging tasks, removing practice opportunities. AI decomposes work into discrete tasks, reducing complexity. AI mediates or replaces expert-novice relationships, severing connection.
What PSF does with it: Beane grounds PSF's judgment stock dimension with the most direct empirical and theoretical account of how that stock is built and how it erodes. The three conditions are the developmental conditions under which consequence-built tacit knowledge (PSF's judgment stock) accumulates. When AI engagement degrades Challenge, Complexity, and Connection, it is not merely a training problem or a skill gap: it is the systematic erosion of the conditions that produce the evaluative capacity needed to detect proxy-criterion divergence. PSF uses Beane to support the temporal claim: proxy seduction does not just displace current criteria, it degrades the pipeline through which the criterion-level judgment that would detect substitution is reproduced across cohorts. The interview protocol should probe Beane's three conditions as organizational indicators of judgment stock health.
Polanyi, M. (1966) The Tacit Dimension. Garden City, NY: Doubleday.
What Polanyi argues: Polanyi developed the concept of tacit knowledge to challenge the prevailing view that all genuine knowledge could be made explicit and formalized. His famous formulation: "we can know more than we can tell." This is not merely a claim about communication difficulty. It is a claim about the structure of knowledge itself. Polanyi distinguishes focal awareness (what we attend to) from subsidiary awareness (what we attend from). When riding a bicycle, we attend to staying balanced and navigating. We attend from our sense of the handlebars, pedals, and our own shifting weight. Critically, if we try to focus on the subsidiary elements, we lose the skill. The expert cyclist who attends to their weight distribution rather than the road ahead will wobble. Tacit knowledge operates in the subsidiary dimension and resists being made focal without destroying the very competence it enables. Tacit knowledge is passed on through apprenticeship and practice, not through instruction manuals.
What PSF does with it: Polanyi grounds PSF's judgment stock dimension. Judgment stock is the accumulated tacit knowledge of practitioners who have lived through the consequences of their evaluations. It is built through the feedback loop connecting evaluation to consequences. When AI engagement shifts practitioners from producing work to reviewing AI-generated work, the subsidiary awareness that develops through production is not developed through review. The practitioner retains focal knowledge (they can articulate the evaluation criteria) while losing the subsidiary feel that makes their evaluation reliable. This is why PSF predicts that proxy seduction deepens as engagement intensifies: the tacit foundation of criterion-level judgment erodes in the subsidiary dimension without registering in any explicit quality metric.
Collins, H. (2010) Tacit and Explicit Knowledge. Chicago: University of Chicago Press.
What Collins argues: Collins extends and systematizes Polanyi's insight by distinguishing types of tacit knowledge. Relational tacit knowledge is knowledge that could in principle be made explicit but has not been. Somatic tacit knowledge is embodied in the body and brain. Collective tacit knowledge is the strongest form. It exists only in communities of practice and can only be acquired through socialization into those communities. The knowledge of how to conduct oneself as a scientist, what counts as an interesting question, a convincing argument, a legitimate method, is collective tacit knowledge. Collins argues we have "no foreseeable way to describe it fully or build machines that possess it." Collective tacit knowledge is not merely difficult to articulate: it is not the kind of thing that can be articulated. It exists in the relations between people, not in any person or document.
What PSF does with it: Collins grounds the organizational dimension of judgment stock in PSF. The proxy-criterion discrimination that PSF identifies as the key evaluative capacity is largely collective tacit knowledge: the standards for distinguishing code robustness from code that merely passes tests exist in professional communities and are acquired through socialization. As AI engagement reduces shared practice, the community interactions through which collective tacit knowledge sustains and transmits are weakened. Proxy seduction thus has a second-order effect on judgment stock: it does not merely shift what practitioners attend to, it erodes the community infrastructure through which the criterion-level judgment is reproduced.
Collins, H. and Kusch, M. (1998) The Shape of Actions: What Humans and Machines Can Do. Cambridge, MA: MIT Press.
What Collins and Kusch argue: Collins and Kusch distinguish two fundamental types of human action. Mimeomorphic actions are those that actors try to carry out "in the same way" across similar situations. The action has a correct form that can be specified, demonstrated, and replicated. Polimorphic actions vary with social context in ways that cannot be fully specified in advance. They require judgment about what counts as "the same situation" and what response is appropriate.
What PSF does with it: The mimeomorphic/polimorphic distinction grounds PSF's account of what is lost when judgment stock erodes. AI-constituted proxy metrics evaluate mimeomorphic surfaces: does the output meet the specified criteria, pass the tests, achieve the throughput targets. The accountable criteria that proxy seduction displaces are largely polimorphic: does the output reflect the kind of situated judgment that distinguishes expert from novice. Proxy metrics can measure the mimeomorphic surface reliably. They cannot detect the erosion of polimorphic capacity because that capacity was never measurable by the instruments that make proxy metrics attractive.
Collins, H. (2018) Artifictional Intelligence: Against Humanity's Surrender to Computers. Cambridge: Polity Press.
What Collins argues: Collins acknowledges that AI is better at mimicking social competence than he initially anticipated, but maintains that without embodiment and membership in a human community, AI cannot possess genuine social tacit knowledge. The machine can produce outputs that look polimorphic without being polimorphic in the generative sense. Collins calls this "faking" socialness: a high-level statistical mimicry that reproduces the shape of social action without participating in the social life that gives that shape its meaning. Collins poses the "behavioral bridge" challenge: if the mimicry is convincing to observers, does the philosophical distinction matter in practice?
What PSF does with it: PSF's contribution is to explain how organizations come to treat mimicry as equivalent to situated judgment through the mechanism of proxy seduction, not through a general evaluative failure. The proxy metric (individual output quality, user satisfaction ratings) registers mimicry as satisfactory because it evaluates the mimeomorphic surface. The accountable criterion (collective diversity, polimorphic contribution to the portfolio) is what erodes. PSF adds to Collins's framework the institutional mechanism through which this failure mode is reproduced and stabilized: once market-logic metrics constitute AI-generated outputs as "productive," professional-logic criteria for detecting the mimicry gap become organizationally illegible.
Dreyfus, H.L. and Dreyfus, S.E. (1986) Mind Over Machine: The Power of Human Intuition and Expertise in the Era of the Computer. New York: Free Press.
What Dreyfus and Dreyfus argue: Dreyfus and Dreyfus argue that human expertise develops through stages that resist mechanization. Beginners follow rules; advanced beginners recognize context-specific features; competent practitioners adopt priorities and plans; proficient practitioners see situations holistically; experts act intuitively from a vast repertoire of experience-based patterns. Crucially, expertise emerges from thousands of hours of situated practice with feedback. It cannot be accelerated through instruction because it is built from the accumulated experience of being wrong and learning from it.
What PSF does with it: Dreyfus and Dreyfus ground the temporal dimension of PSF's judgment stock claim: judgment stock is not a stable endowment but a developmental achievement that requires ongoing practice to maintain and extend. If AI engagement freezes practitioners at intermediate developmental stages by removing the challenge and complexity through which they would advance, then the judgment stock available for proxy-criterion discrimination is permanently capped below the expert level. This supports PSF's cohort claim: organizations that heavily engage AI among early-career practitioners may produce cohorts that never develop the senior-level judgment stock that would make detection of proxy-criterion divergence possible.
Bainbridge, L. (1983) 'Ironies of Automation', Automatica, 19(6), pp. 775-779.
What Bainbridge argues: Bainbridge identified a fundamental paradox in automated systems: automation intended to eliminate human error paradoxically creates new forms of human error. Two core ironies. First, as automation handles routine operations, human operators become monitors rather than active controllers. Their active skills atrophy through disuse. Yet they are expected to intervene competently during rare system failures when their skills have most degraded. Second, designers who automate routine tasks often leave the most difficult tasks to humans, precisely the ones that are most resistant to automation.
What PSF does with it: Bainbridge grounds the judgment stock erosion argument empirically. PSF's claim is that AI engagement degrades the feedback loop through which judgment stock is built and maintained. Bainbridge shows this pattern is well-documented in adjacent domains (industrial control, aviation) before it appears in knowledge work. The irony of automation is a specific form of PSF's mechanism: operators are monitoring the system that has replaced their active skill, but the monitoring does not reproduce the skill because it lacks the consequence-exposure loop that built the skill in the first place.
Beane, M. (2019) 'Shadow Learning: Building Robotic Surgical Skill When Approved Means Fail', Administrative Science Quarterly, 64(1), pp. 87-123.
What Beane argues: Beane's ethnographic study of robotic surgery training reveals how novice surgeons, denied legitimate access to challenging cases by supervision protocols, developed skills "in the shadow" of the system: finding workarounds to get the practice they needed. Protecting novices from challenge also protects them from skill development.
What PSF does with it: Shadow learning is relevant to PSF's empirical design: the interview protocol should probe whether practitioners with high judgment stock in AI-intensive environments maintain their stock through shadow learning activities (side projects, open source work, independent practice) rather than through their primary organizational work. In AI knowledge work, shadow learning is harder to access than in Beane's surgical context because the challenge has been removed at the task level, not just the supervision level.
Friis, O.V. and Riley, J. (2024) 'Automation and the Loss of Competence: Theoretical Perspectives', Journal of Applied Psychology. (In press)
What Friis and Riley argue: Friis and Riley review theoretical foundations for competence loss under automation, identifying three distinct mechanisms: skill degradation through disuse (the use-it-or-lose-it mechanism), skill non-acquisition among new entrants (the never-built-it mechanism), and metacognitive miscalibration (the overconfidence mechanism). These three mechanisms interact: disuse reduces active skill, non-acquisition reduces the cohort baseline, and overconfidence prevents the recognition that skill levels have changed.
What PSF does with it: Friis and Riley's three mechanisms map directly onto PSF's judgment stock erosion across three temporal horizons. Disuse degradation maps onto current practitioners who are losing the feedback loop that maintained their judgment stock. Non-acquisition maps onto entering practitioners who are building judgment stock in AI-intensive environments that never provide the consequence-exposure required for full development. Metacognitive miscalibration maps onto Shaw and Nave's cognitive surrender: practitioners whose judgment stock has degraded do not know it has degraded because the recalibration mechanism (consequence feedback) has been disrupted by AI engagement.
Simkute, A., McAulay, D. and Sellen, A. (2025) 'The Absent Expert: Shifting Roles in AI-Assisted Design', CHI Conference Proceedings.
What Simkute et al. argue: Simkute, McAulay, and Sellen document a systematic shift in designer roles when AI design tools are introduced: from active production to passive evaluation. The shift is rapid and often unreflective. Designers move into evaluation roles without explicit decision or discussion. Production-mode competencies (generative facility, exploratory thinking, material fluency) are exercised less while evaluation-mode competencies (critical assessment, selection criteria, feedback articulation) are exercised more.
What PSF does with it: The production-to-evaluation shift is the practice-level observable through which PSF's judgment stock erosion mechanism operates in knowledge work: practitioners shift from producing to evaluating, and the productive competencies that sustain judgment stock (the tacit feel for craft, the material fluency, the generative exploration) are no longer practiced. The interview protocol should probe this shift as an early observable indicator of judgment stock degradation.
Shen, J.H. and Tamkin, A. (2026) 'How AI impacts skill formation', arXiv.
What Shen and Tamkin argue: In a randomised experiment, AI-assisted developers scored 17% lower on comprehension assessments than unassisted developers. The interaction patterns developers reported as most productive were the ones that prevented learning.
What PSF does with it: Shen and Tamkin quantify the divergence between the proxy (task completion speed) and the criterion (understanding of the code produced). The finding that the most productive-feeling patterns are the most learning-preventing patterns is a direct instantiation of PSF's self-concealing mechanism. Proxy seduction operates through the very practices that feel most effective.
Bastani, H., Bastani, O. and Sungu, A. (2025) 'Generative AI without guardrails: Metacognitive decoupling in AI-assisted learning.'
What Bastani et al. argue: Students using standard ChatGPT scored 17% worse on unassisted exams while reporting confidence in learning that did not occur. The felt-learning proxy and the actual-learning criterion decouple, and the decoupling is invisible to the learner.
What PSF does with it: Bastani et al. confirm the metacognitive dimension of proxy seduction. The confidence-without-competence pattern maps directly onto PSF's detection failure mechanism: practitioners believe they are learning (the proxy) while their unassisted capability erodes (the criterion). Confirms Shen and Tamkin's non-formation finding in a different context.
Braking is whether evaluative institutional infrastructure can interrupt proxy displacement once underway. Braking failure means not that organizations lack the formal capacity to intervene, but that the signals that would trigger intervention are themselves products of the proxy evaluation apparatus.
Argyris, C. (1990) Overcoming Organizational Defenses: Facilitating Organizational Learning. Boston: Allyn and Bacon.
What Argyris argues: Argyris spent decades studying why organizations fail to learn from experience. His answer: defensive routines. Organizations develop systematic practices for avoiding threatening information, protecting existing beliefs, and preventing embarrassment. Defensive routines are "any policy, practice, or action that prevents organizational participants from experiencing embarrassment or threat and, at the same time, prevents them from discovering the causes of the embarrassment or threat." They are doubly dangerous: they prevent learning, and they prevent recognition that learning is being prevented. Argyris identifies "skilled incompetence": the ability to produce precisely the defensive behaviors that prevent learning while believing oneself to be acting rationally and constructively. Defensive routines are self-sealing: attempts to discuss them trigger more defensiveness.
What PSF does with it: Argyris grounds PSF's braking dimension directly. Braking refers to whether organizational evaluative infrastructure can interrupt proxy displacement once underway. Defensive routines are the mechanism through which braking fails: the same organizational infrastructure that would be needed to surface proxy-criterion divergence is captured by the defensive routines protecting the AI investment narrative. PSF's distinctive addition to Argyris is the constitutive mechanism: it is not merely that defensive routines protect bad decisions after the fact, but that AI engagement has already reconstituted what counts as a good decision by constituting proxy metrics as the operative evaluation vocabulary. Argyris's defensive routines then protect the constituted reality, not just a prior choice.
Weick, K.E. (1995) Sensemaking in Organizations. Thousand Oaks, CA: Sage.
What Weick argues: Weick establishes sensemaking as a distinct process by which organizations construct actionable understanding from ambiguous situations. Sensemaking is not decision-making or interpretation. It is the prior process of constructing the situation that will then be interpreted and decided upon. Seven properties, with the plausibility criterion being crucial: organizations do not have the time or cognitive resources to verify accuracy. They settle for accounts that hang together, that fit with what they already believe, that enable action.
What PSF does with it: Weick contributes to PSF's account of why the braking dimension fails to activate. The proxy narrative is plausible: it is consistent with direct observable experience (individual outputs are faster and often cleaner), socially validated within organizations (colleagues report similar improvement), and consistent with field-level discourse (vendor claims, consultant reports, benchmark results). The criterion-level narrative (we are substituting proxy metrics for accountable criteria and our judgment is eroding) is implausible by Weick's criteria: it is not directly observable, not socially validated, and inconsistent with the dominant field narrative. Sensemaking will stabilize the plausible account and suppress the implausible one, which is precisely what PSF predicts: the felt experience of improvement drives the displacement, and the displacement is sustained by sensemaking processes that favor the market-logic account.
March, J.G. (1991) 'Exploration and Exploitation in Organizational Learning', Organization Science, 2(1), pp. 71-87.
What March argues: March identifies a fundamental tension between exploration (developing new knowledge, capabilities, options) and exploitation (refining existing competencies). Both are essential. They compete for scarce resources. Exploitation involves refinement, efficiency, selection, execution. Its returns are relatively certain, proximate, and easy to measure. Exploration involves search, variation, risk-taking, discovery. Its returns are uncertain, distant, and hard to measure. Organizations tend toward exploitation because its returns are more visible and certain. An organization that focuses on exploitation will improve short-run performance but become increasingly obsolete.
What PSF does with it: March is load-bearing in PSF for the asymmetric ambidexterity construct. PSF's claim is that organizations do not fail at both exploitation and exploration. They succeed visibly at exploitation while exploration degrades invisibly. AI engagement constitutes market-logic metrics (exploitation observables) as the evaluative vocabulary while rendering professional-logic criteria (exploration capacity) invisible. March's exploitation/exploration tension provides the organizational learning foundation for this asymmetry: the self-reinforcing property of exploitation means that once market-logic metrics are constituted as the operative vocabulary, the resources, attention, and institutional support that would maintain exploration-oriented professional logic criteria are progressively diverted. PSF uses March to show that asymmetric ambidexterity is not an accident or a manageable tradeoff but a structural tendency of organizational learning that AI engagement amplifies.
Levitt, B. and March, J.G. (1988) 'Organizational Learning', Annual Review of Sociology, 14, pp. 319-340.
What Levitt and March argue: Organizations encode lessons from experience into routines, but the encoding process is subject to distortions. Superstitious learning occurs when organizations draw incorrect causal inferences from experience. Competency traps occur when favorable performance with an existing procedure leads to accumulated experience that reinforces commitment, even when a superior alternative exists.
What PSF does with it: Levitt and March's superstitious learning and competency trap mechanisms are the organizational learning pathways through which proxy seduction becomes institutionally locked in. Superstitious learning explains how the proxy narrative gets encoded as organizational knowledge: the organization attributes aggregate output improvements to AI engagement without recognizing that the improvement reflects proxy-metric gains rather than criterion-level outcomes. Competency traps explain why proxy seduction is difficult to reverse even after detection: the organization has built routines, skills, and resource allocation patterns around AI-assisted workflows.
Marquis, C. and Lounsbury, M. (2007) 'Vive la Résistance: Competing Logics and the Consolidation of Community Banking', Academy of Management Journal, 50(4), pp. 799-820.
What Marquis and Lounsbury argue: They provide an empirical demonstration of how competing institutional logics produce divergent organizational responses to identical environmental pressures. Their study of community banking shows that logic prevalence in local contexts predicted outcomes better than economic variables. Two banks with identical financial profiles might have opposite fates depending on which logic dominated their community. Once a logic prevailed, it became self-reinforcing.
What PSF does with it: Marquis and Lounsbury contribute to PSF's account of the braking failure: when proxy seduction has elevated market logic to dominance, professional logic criteria cannot reassert themselves through evidence because logic conflicts resolve through power rather than accuracy. This supports PSF's claim about the structural difficulty of detection: even when organizations have practitioners with sufficient judgment stock to detect proxy-criterion divergence, the institutional political dynamics may prevent the detection from becoming organizationally actionable.
These sources do background, reviewer-defense, or empirical-design work without being primary PSF-load-bearing sources. They are grouped thematically. Entries retain the full "What X argues" treatment for reference value but note their specific PSF function.
Nicolini, D. (2012) Practice Theory, Work, and Organization: An Introduction. Oxford: Oxford University Press.
What Nicolini argues: Nicolini provides a comprehensive synthesis of practice-theoretical approaches to organization. Practices are the fundamental unit of analysis for understanding social and organizational life. They are organized constellations of activities that hang together because they share understandings, rules, and teleoaffective structures. Knowledge exists in practice, not before it. Practices are materially mediated.
What PSF does with it: Nicolini's practice theory contributes the deep ontological grounding for PSF's claim that proxy seduction operates through sincere belief. If evaluation criteria are practice-constituted, then practitioners who have practiced AI-assisted work have genuinely constituted different criteria through that practice. They are not misremembering or strategically misrepresenting their standards; their standards have been reconstituted through the practice itself.
Schatzki, T.R. (2001) 'Introduction: Practice theory', in Schatzki, T.R., Knorr Cetina, K. and von Savigny, E. (eds.) The Practice Turn in Contemporary Theory. London: Routledge, pp. 10-23.
What Schatzki argues: Schatzki identifies three elements that hold a practice together: practical understandings (know-how), rules (explicit formulations), and teleoaffective structures (the shared sense of what the practice is for, what counts as success, what emotions are appropriate). Teleoaffective structures are collective and largely implicit, absorbed through participation rather than taught through instruction.
What PSF does with it: Schatzki's teleoaffective structures provide the practice-theoretical grounding for PSF's salience decay construct. What has changed is not the practical understandings (practitioners can still articulate what good code looks like) or the explicit rules (code review requirements have not changed), but the teleoaffective structure: the shared sense of what software development is for, what counts as satisfying work. When AI engagement shifts this from "build and understand" to "direct and review," the criteria that were operative under the old structure remain articulable but no longer feel like the right way to assess work.
Orlikowski, W.J. (2007) 'Sociomaterial practices: Exploring technology at work', Organization Studies, 28(9), pp. 1435-1448.
What Orlikowski argues: Drawing on Karen Barad's agential realism, Orlikowski takes an onto-epistemological position: the social and material do not interact as separate things. They constitute each other in practice. AI capabilities do not exist apart from organizational practice. There is no "AI in itself" to evaluate. AI only exists as AI-in-practice.
What PSF does with it: PSF uses Faulkner and Runde's form/function distinction rather than agential realism from Orlikowski, because Faulkner and Runde give more analytic traction for specifying what organizations get wrong (form/function conflation) and why (function emerges through practice). Orlikowski's contribution to PSF is primarily to ground the claim that AI-in-practice differs from AI-in-prospect.
Barad, K. (2007) Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning. Durham, NC: Duke University Press.
What Barad argues: The observer and observed are not separate things that exist before observation and then come into contact. They come into being together through the act of observing. The measurement apparatus participates in creating what gets measured. Boundaries between observer and observed are not pre-given. They are enacted through "agential cuts."
What PSF does with it: Barad remains relevant as background grounding for the constitutive claims PSF makes, particularly the claim that measuring proxy metrics makes the organization into something that produces proxy-metric results. If reviewers challenge PSF's constitutive claims on ontological grounds, Barad provides the deeper philosophical warrant.
Kellogg, K.C., Valentine, M.A. and Christin, A. (2020) 'Algorithms at Work: The New Contested Terrain of Control', Academy of Management Annals, 14(1), pp. 366-410.
What Kellogg et al. argue: They provide a comprehensive review of algorithmic management, identifying six mechanisms of algorithmic control: Restricting, Recommending, Recording, Rating, Replacing, and Rewarding. Rating algorithms do not merely measure performance: they constitute what "performance" means. Once the algorithm defines performance, alternatives become invisible.
What PSF does with it: Kellogg et al. is load-bearing for the institutional logics displacement mechanism at the organizational level. The six mechanisms describe how AI-mediated algorithmic control reconstitutes what "performance" means through Rating and Recording. The AI tools organizations use track, record, and rate against market-logic metrics (speed, throughput, completion rates), and that tracking constitutes those metrics as the operative definition of performance. This is Barnesian performativity operating at the organizational level.
Albert, S. and Whetten, D.A. (1985) 'Organizational identity', Research in Organizational Behavior, 7, pp. 263-295.
What Albert and Whetten argue: Organizations have self-definitions that function like individual identity, centering on features seen as central, distinctive, and enduring. Identity claims are not mere descriptions but commitments that shape what the organization can perceive and do.
What PSF does with it: Proxy seduction operates through genuine identification with the new metrics, not strategic misrepresentation. Practitioners who have built their professional identity around "building sophisticated systems" may sincerely adopt "managing AI-generated outputs" as equivalent because the identity label persists even as the underlying practice reconstitutes.
Whetten, D.A. (2006) 'Albert and Whetten Revisited: Strengthening the Concept of Organizational Identity', Journal of Management Inquiry, 15(3), pp. 219-234.
What Whetten argues: Identity claims are performative: they do not just describe the organization but partially constitute it by creating accountability structures.
What PSF does with it: Once an organization publicly commits to being "AI-forward" or "AI-native," that identity claim creates accountability structures that resist disconfirming evidence. The braking dimension of evaluative capacity is weakened when identity commitments are at stake.
Gioia, D.A., Schultz, M. and Corley, K.G. (2000) 'Organizational identity, image, and adaptive instability', Academy of Management Review, 25(1), pp. 63-81.
What Gioia et al. argue: Identity maintains apparent continuity through stable labels while the meanings of those labels shift. An organization that has always been "innovative" may mean something quite different by "innovation" now than it did twenty years ago. The label persists; the substance changes. This allows organizations to adapt while maintaining a sense of continuity.
What PSF does with it: Gioia et al.'s adaptive instability is an important PSF mechanism for explaining how proxy seduction persists without detection. PSF's construct of salience decay operates through the adaptive instability mechanism: practitioners keep applying the label "quality" or "good code" while the substantive meaning behind those labels quietly reconstitutes around AI-constituted proxies. The label stability masks the criterion drift. This is not strategic deception; it is the normal adaptive process, operating on evaluative vocabulary. Adaptive instability is one of the micro-mechanisms through which PSF's sincere belief claim is supported.
Nag, R., Corley, K.G. and Gioia, D.A. (2007) 'The intersection of organizational identity, knowledge, and practice', Academy of Management Journal.
What Nag et al. argue: Organisational identity is constituted through knowledge practices, not merely declared through labels. The practices through which an organisation enacts its core work are what give meaning to identity claims.
What PSF does with it: Nag et al. establish the identity-practice constitution link that PSF relies on for the salience decay account. When AI engagement changes the practices through which craft is enacted, the identity migrates even as the label holds. A development team that still calls itself "craftspeople" may no longer mean what it originally meant by "craft." From inside the organisation, the migration from professional to market logic feels like continuity rather than change. Paired with Gioia et al.'s adaptive instability: the label persists while the practice reconstitutes.
Barrett, M., Oborn, E., Orlikowski, W.J. and Yates, J. (2012) 'Reconfiguring Boundary Relations: Robotic Innovations in Pharmacy Work', Organization Science, 23(5), pp. 1448-1466.
What Barrett et al. argue: Ethnographic study of pharmaceutical-dispensing robots showing complex, contradictory boundary reconfiguration. The same technology produced different effects for different groups. What elevated pharmacists' clinical role simultaneously deskilled assistants' dispensing work. Boundaries were reconfigured through distributed, emergent, and situational adjustments, not through designated intermediaries.
What PSF does with it: Methodologically foundational for PSF's empirical design through the boundary activity reframing. Asking "who are the bridge actors?" would have missed much of what Barrett observed. Asking "where does boundary activity occur?" captures the distributed, practice-constituted nature of the phenomenon. PSF extends by asking specifically what happens to the proxy-criterion relationship during boundary reconfiguration.
Carlile, P.R. (2004) 'Transferring, Translating, and Transforming: An Integrative Framework for Managing Knowledge Across Boundaries', Organization Science, 15(5), pp. 555-568.
What Carlile argues: Three progressively complex boundary types: syntactic (information transfer suffices), semantic (translation required), and pragmatic (transformation necessary because interests and practices have diverged). Novelty determines which type is operative.
What PSF does with it: Carlile's progression maps onto the depth of proxy-criterion divergence: early in engagement, the gap can be surfaced through translation. As engagement deepens and proxy metrics become the constituted vocabulary, the gap shifts to pragmatic: surfacing it would require managers to transform their evaluative framework. This is why proxy seduction deepens over time: the boundary conditions for correction become progressively more demanding while the organizational resources for pragmatic boundary work are simultaneously eroding.
Pickering, A. (1995) The Mangle of Practice: Time, Agency, and Science. Chicago: University of Chicago Press.
What Pickering argues: Knowledge emerges through iterative cycles of resistance and accommodation between human intentionality and material agency. In AI contexts, resistance operates bilaterally: AI resists human intention through adaptive outputs, and AI also accommodates human inputs. The human tunes to the AI; the AI tunes to the human. Stabilization is not just difficult but conceptually unclear: what would it mean to "finish tuning" when the thing you are tuning to is itself tuning to you?
What PSF does with it: Pickering's mangle grounds the temporal dimension of proxy seduction. Bilateral tuning means proxy metric constitution is not a one-time event but an ongoing process. As practitioners tune to AI outputs, their sense of what constitutes "good enough" is continuously recalibrated by what AI reliably produces, which is precisely the proxy-metric surface that market logic makes legible.
Oborn, E., Barrett, M., Orlikowski, W.J. and Kim, A. (2019) 'Trajectory Dynamics in Innovation', Organization Science, 30(5), pp. 1097-1123.
What Oborn et al. argue: Innovations are not fixed objects but trajectories, ongoing processes of development and transformation. Four patterns: separation, coordination, diversification, and integration.
What PSF does with it: Organizations where AI engagement follows the integration pattern are most susceptible to PSF's mechanism because integration involves the most thorough reconstitution of evaluative criteria.
Faik, I., Barrett, M. and Oborn, E. (2020) 'How Information Technology Matters in Societal Change: An Affordance-Based Institutional Logics Perspective', MIS Quarterly, 44(3), pp. 1359-1390.
What Faik et al. argue: Logics shape which affordances actors perceive and how they actualize them. They identify three mechanisms: sensegiving, translating, and decoupling.
What PSF does with it: Faik et al.'s affordance-institutional logics integration is the closest existing framework to PSF's account of how market logic constitutes AI's proxy-metric affordances as the operative ones while rendering professional-logic affordances imperceptible. The decoupling mechanism is directly relevant to the Stack Overflow data: practitioners adopting AI while experiencing proxy-criterion divergence they cannot surface institutionally.
Barrett, M. and Orlikowski, W. (2021) 'Scale matters: Doing practice-based studies in the digital world', MIS Quarterly, 45(1b), pp. 467-472.
What Barrett and Orlikowski argue: Digital technologies complicate traditional notions of "the local" because digital practices are inherently multi-scalar.
What PSF does with it: PSF's empirical design needs to trace multi-scalar dynamics: practitioner-level criterion drift, organizational-level logic displacement, field-level Barnesian performativity. A single-site study will miss the field-level dynamics; a field-level study will miss the practitioner-level judgment stock erosion.
Teece, D.J. (2007) 'Explicating dynamic capabilities', Strategic Management Journal, 28(13), pp. 1319-1350.
What Teece argues: The dynamic capabilities framework holds that firms can sense opportunities and threats, seize value, and transform themselves. A key assumption: firms can sense environmental changes accurately.
What PSF does with it: PSF challenges this assumption for transformative AI. If AI engagement transforms sensing capacity itself, the dynamic capabilities framework cannot operate as theorized. The "dynamic capability" organizations believe they are building may be a proxy capability constituted by the engagement rather than a genuine enhancement of evaluative capacity.
Cohen, W.M. and Levinthal, D.A. (1990) 'Absorptive Capacity', Administrative Science Quarterly, 35(1), pp. 128-152.
What Cohen and Levinthal argue: An organization's ability to recognize the value of new external knowledge, assimilate it, and apply it is a function of its prior related knowledge. Absorptive capacity is path-dependent and cumulative, and it contains an investment paradox: you need absorptive capacity to recognize the value of investing in absorptive capacity.
What PSF does with it: If AI engagement degrades the conditions under which prior related knowledge develops, then absorptive capacity erodes across generations. The self-inflicted lockout dynamic: organizations create the conditions for their own future inability to absorb. The practitioners who would detect proxy-criterion divergence are those with the highest judgment stock, but AI engagement systematically reduces the developmental conditions through which judgment stock builds, so detection capacity degrades across cohorts.
O'Reilly, C.A. and Tushman, M.L. (2013) 'Organizational Ambidexterity', Academy of Management Perspectives, 27(4), pp. 324-338.
What O'Reilly and Tushman argue: Ambidexterity is the organizational capacity to simultaneously pursue exploitation and exploration. Resource allocation between units requires accurate assessment of relative opportunity.
What PSF does with it: PSF's asymmetric ambidexterity is positioned against this literature. Ambidexterity frameworks cannot solve proxy seduction by better structural design because the measurement problem is upstream of structural choices: organizations cannot balance what they cannot see.
Wan, F., Yang, T., Shi, X., Rong, K. and Ansari, S. (2026) 'Scaling high and wide: how firms leverage AI and organizational design to overcome the scale-scope trade-off', Strategic Management Journal, forthcoming. Accepted version: Academy of Management Proceedings (2025), Cambridge Apollo repository. Best Paper Award, EURAM 2025.
What Wan et al. argue: Wan and colleagues extend the resource-based view by positioning AI as a fungible, scale-free resource that dissolves the classical scale-scope trade-off. Drawing on a longitudinal case study of ByteDance from 2012 to 2024 (34 interviews with 29 professionals, field observation, archival data), they develop a three-phase process model: building AI to scale up through user preference shaping and network effect amplification; replicating AI to expand scope through middle-platform architecture and independent spin-offs; leveraging AI to scale up and expand scope simultaneously through business-unit restructuring and ecosystem partner empowerment. The theoretical claim is that AI possesses three distinctive resource properties: replicability without depreciation across domains, self-reinforcement through data feedback loops, and freedom from opportunity cost. The classical RBV assumption that resources are rivalrous in use does not hold for AI.
What PSF does with it: Wan et al. belongs in this section for the same reason Teece, Cohen and Levinthal, and O'Reilly and Tushman belong: it is a strategy framework whose claims presuppose evaluative continuity, and PSF challenges the presupposition rather than the empirical observations the framework reports. The RBV extension asserts that AI's replicability and self-reinforcement produce exponential growth across domains. The evidence for this claim is the metric trajectory at ByteDance (user growth, revenue, cross-domain scaling). The engagement-as-value equation is not argued. It operates at three levels of the paper: frame (the ecosystem-as-value-cocreation register treats users contributing engagement as symmetric with creators contributing content and advertisers contributing monetization), citation chain (the AI-as-resource claim is delegated to Gregory et al., Sjödin et al., Krakowski et al.), and operational treatment (user retention is the success criterion throughout the findings section, with no distinction drawn between retention driven by users finding content valuable and retention driven by algorithmic engineering against attrition).
PSF's divergence from Wan et al. does not require rejecting the firm's stated accountable criterion in favor of an external welfare criterion the firm did not accept. The mechanism operates inside the firm's own revenue criterion, through channels the engagement metric does not measure: ad-inventory quality (engagement volume grows while advertiser bid-depth erodes as attention quality degrades), supply-side content-mix shifts (creator base homogenizes, platform loses adaptive capacity), and regulatory tail risk (ByteDance runs restricted Douyin for the Chinese domestic market and unrestricted TikTok for export markets, indicating internal operational recognition of proxy-criterion gaps that the public-facing metric frame does not surface). This keeps PSF inside strategy-register and distinguishes it from welfare-based attention-platform critiques (Harris, Zuboff, adjacent) that operate from externally-imposed criteria.
The paper's evidence base contains a direct observation of the PSF mechanism in its primary data. See Section 5.1 (ByteDance "algorithmic failure" passage) for the extracted passage and its evidentiary treatment. The foil value is not that Wan et al. take a different view. The foil value is that the paper's evidence base contains the PSF mechanism and the paper's theoretical frame cannot register it. This parallels the pattern PSF identifies at the organizational level: the mechanism operating inside the firm while the evaluative apparatus cannot surface it. Here the pattern operates at the scholarly-field level. The paper is peer-reviewed, forthcoming in a top-tier journal (SMJ), received EURAM Best Paper 2025, and is authored by a senior voice on multiple editorial boards. The scholarly evaluative apparatus at its most credentialed layer does not register the mechanism its own primary data documents. PSF treats this as a field-level instance of the same evaluative capacity failure the framework identifies at the organizational level.
Status note: Wan et al. is forthcoming in SMJ. The Academy of Management Proceedings accepted version (2025, Cambridge Apollo) is the citable anchor pending publication. Re-check against SMJ version when it appears; full-length version may contain mechanism detail not present in the 35-page Proceedings abridgement.
Relationship to capacity drawdown (Section 3.2): The ByteDance case illustrates capacity drawdown at the ecosystem layer, paired with the individual-layer evidence documented by Shen and Tamkin, Bastani et al., and Beane in Section 3.2. The same proxy-constituted optimization loop operates at different organizational layers with different substrates drawn down. Sziebert's practitioner observation of the eighteen-month wall (Section 5.3) provides a practitioner-voice supplement to the peer-reviewed individual-layer sources.
PSF operates at the organizational level, explaining why organizations systematically misjudge AI engagement outcomes and fail to self-correct. Most available evidence comes from individuals. This is not a limitation. It is the nature of the phenomenon. What organizations do is pattern individual experiences, stabilize them through routines and identity, and make them invisible through aggregation and sensemaking.
The evidence varies in epistemic status. Direct evidence measures perception-reality gaps with methodological rigor. Mechanism hypotheses are borrowed from Human Factors, to be tested. Field-level patterns are visible only across organizations. Gray literature and field notes illustrate the phenomenon the inquiry is trying to explain, not evidence for the theory itself.
METR (2025) 'Measuring the Impact of Early-2025 AI Models on Experienced Open-Source Developer Productivity.'
METR conducted a pre-registered randomized controlled trial with 16 experienced open-source developers working on their own repositories. Developers expected AI tools would make them 24% faster. They were actually 19% slower. The 39-point perception-reality gap occurred among practitioners who should have been ideally positioned for accurate self-assessment: experienced, working on their own code, with every reason to assess accurately.
PSF frames METR as the signature instance of proxy seduction: practitioners whose judgment stock has been constituted through years of direct consequence exposure construct plausible accounts of AI-assisted speed (the proxy metric: throughput) while missing the professional-logic costs (actual task completion time). The gap is not random noise; it is what Barnesian performativity produces when the field-level AI productivity narrative has constituted speed-of-output as the operative evaluative category before objective measurement has been applied. This is the anchor finding. If sophisticated practitioners with aligned incentives cannot perceive AI's effects accurately, something structural is happening that individual judgment and experience cannot overcome.
Daniotti, L., Impink, S., Perrone, G., Tangi, L. and Traverso, S. (2026) 'Generative AI and Developer Productivity: Evidence from GitHub Copilot', Science. (In press)
Daniotti and colleagues analyzed 31 million commits from 160,097 developers over an extended period, using a within-developer panel design. Early-career developers used AI-generated code in 37% of their work compared to 27% for senior developers. Yet only senior developers showed productivity gains (6.2% increase). Early-career developers showed no measurable productivity improvement despite substantially higher AI usage.
Daniotti et al. directly support PSF's judgment stock mechanism. Senior developers have higher judgment stock (consequence-built tacit knowledge from years of building and maintaining systems) and can detect when AI-generated code produces the proxy metric (it looks right, passes tests) while missing the criterion (it is subtly wrong in ways that will cause problems). Early-career developers lack the judgment stock to make this discrimination, so they cannot convert AI assistance into criterion-level quality improvement even at higher usage rates. Science publication confirms the pattern at scale with strong methodology.
Dell'Acqua, F., McFowland, E. III, Mollick, E.R., Lifshitz-Assaf, H., Kellogg, K.C., Rajendran, S., Krayer, L., Candelon, F. and Lakhani, K.R. (2026) 'Navigating the jagged technological frontier: Field experimental evidence of the effects of artificial intelligence on knowledge worker productivity and quality', Organization Science, Articles in Advance. DOI: 10.1287/orsc.2025.21838.
758 BCG consultants in a pre-registered experiment. For tasks within AI's capability frontier: substantial gains. For a task outside AI's capability boundary: AI users performed 19 percentage points worse than controls. Consultants could not reliably identify which tasks fell inside versus outside the frontier.
PSF reads the jagged frontier as the spatial structure of the proxy-criterion gap: at any given point in an organization's AI engagement, there is a frontier between domains where AI-constituted proxy metrics track accountable criteria reasonably well and domains where they diverge. The critical finding for PSF is that practitioners cannot perceive this frontier. This is not a knowledge deficit; it is what PSF predicts when market-logic metrics have been constituted as the operative evaluative vocabulary: the criteria that would reveal frontier boundaries are not the criteria being applied. The OS paper reframes Dell'Acqua et al.'s contribution: their framing treats the frontier as a property of the technology rather than asking whether the engagement reshapes the practitioner's capacity to locate the frontier.
Brynjolfsson, E., Li, D. and Raymond, L. (2025) 'Generative AI at Work', Quarterly Journal of Economics. (In press)
5,172 customer service agents, staggered adoption design. Average productivity increased 14%. Novice agents gained 34-35%. Expert agents showed minimal speed gains and slight quality declines. Workers could not revert to pre-engagement performance levels during system outages. High-skill workers increased adherence to AI suggestions even as quality declined.
Brynjolfsson et al. provide three specific PSF contributions. First, the novice-expert inversion supports PSF's judgment stock account: high judgment stock enables expert practitioners to detect quality decline but not to act on the detection because the institutional infrastructure (adherence to AI suggestions) has already been constituted by Barnesian performativity at the organizational level. Second, the irreversibility during outages is evidence of constitutive transformation in Paul's sense. Third, the adherence-increasing-as-quality-declines pattern is direct behavioral evidence of proxy seduction: practitioners are optimizing against the proxy metric (AI adherence rate, which management can observe) while the accountable criterion (resolution quality) erodes unmeasured.
Kang, H. and Kim, Y. (2025) 'Knowledge Without Understanding: AI Predictions and Analyst Decision-Making', Organization Science.
ML predictions improved analyst decision accuracy (the proxy) while degrading causal reasoning (the criterion). Analysts who received ML-generated predictions made better point estimates but lost the ability to articulate why those estimates were correct. The study demonstrates PSF's core mechanism at the individual practitioner level: the proxy metric (decision accuracy) improves while the evaluative capacity underlying it (causal understanding) erodes. This is not Goodhart-style gaming. The analysts are not manipulating anything. The engagement itself constitutes a split between performance and comprehension.
The "knowledge without understanding" formulation maps directly onto PSF's judgment stock concept. Judgment stock is not the ability to reach correct answers but the ability to know why an answer is correct, which is the capacity needed to detect when conditions change and the answer should change too. Kang and Kim show that AI engagement can improve the first while degrading the second, making the degradation invisible to any metric that measures outcomes rather than reasoning capacity. Published in Organization Science, which gives this entry direct disciplinary standing for PSF's target audience.
Liu et al. (2026) 'Rapid Capacity Erosion Under AI Exposure: Three Randomized Trials', arXiv 2604.04721.
Three randomised controlled trials (N=1,222) demonstrated that approximately 10 minutes of AI exposure causally reduced persistence and independent performance on subsequent tasks without AI. This is the fastest-acting causal evidence in the constellation. Participants did not just perform worse without AI after using it. They gave up sooner. The engagement altered not only capability but motivation and self-directed effort, which are components of what PSF calls judgment stock at the individual level.
The speed of effect (~10 minutes) is theoretically significant. It means evaluative capacity erosion is not a slow organisational drift but can begin at the level of a single work session. This compresses the PSF timeline from months or years (as in METR or Cruces) to minutes, suggesting that the constitutive mechanism operates at the level of immediate practice, not gradual habituation. The persistence finding (giving up sooner) is distinct from the accuracy findings in other studies: it suggests AI engagement reshapes the practitioner's relationship to difficulty itself, not just their skill level.
Leonardi, P.M. and Leavell, V. (2026) 'Knowing enough to be dangerous: The problem of "artificial certainty" for expert authority when using AI for decision making and planning', Organization Science, Articles in Advance. DOI: 10.1287/orsc.2023.18224.
Two urban planning organisations used the same AI simulation tool but positioned it differently. One maintained provisionality, treating AI outputs as provisional inputs requiring professional judgment and stakeholder deliberation. The other produced "artificial certainty," presenting simulations as authoritative predictions. The same technical form, positioned differently, produced different patterns of proxy-criterion divergence.
PSF reads Leonardi and Leavell as the cleanest available evidence for Faulkner and Runde's claim that organisational positioning determines which proxies become salient. The provisionality case (Mountain) shows that constraining AI's epistemic authority slows proxy drift. The artificial certainty case shows what happens when no such constraint operates. The positioning can drift through accumulated use without any deliberate organisational decision. This is the most heavily cited new source in the OS Perspectives paper (at least six references).
Bean, A. et al. (2026) 'Human Decision-Making with AI Assistance', Nature.
RCT with 1,298 participants. LLMs alone achieved approximately 95% accuracy. With human users, accuracy dropped to approximately 35%, no better than control. The benchmark proxy (standalone AI accuracy) failed to predict interactive outcome. This directly demonstrates that the proxy metric (AI accuracy on benchmarks) diverges from the criterion (human-AI collaborative performance) in exactly the way PSF predicts. Organizations that evaluate AI tools through benchmark accuracy will systematically overestimate their value in interactive use. The gap between standalone AI performance and human-AI interactive performance is one of the starkest quantifications available.
DORA (2025) 'Accelerate State of DevOps Report'. Google Cloud.
AI code assistance was associated with lower perceived productivity, delivery stability, and job satisfaction among elite performers. The pattern inverted the general trend: average performers reported neutral to positive AI experience; elite performers reported negative experience.
DORA supports PSF's claim that practitioners with the highest judgment stock experience the proxy-criterion gap most acutely. Elite performers have the consequence-built tacit knowledge to detect when AI-generated outputs are proxies for quality rather than quality itself. Their negative experience is detection: the evaluative capacity that proxy seduction most threatens is most active in those who have developed it most fully. The DORA canary-in-the-coal-mine pattern is what PSF predicts when detection capacity is concentrated in senior practitioners: the organization loses its most reliable detection capacity first, through dissatisfaction and disengagement, before the proxy-criterion divergence becomes visible in aggregate metrics.
Vendraminelli, L., Morandi, V. and Gruber, M. (2025) 'The GenAI Wall Effect', Harvard Business School Working Paper No. 26-011.
Documents diminishing returns to AI assistance as task complexity increases. For routine tasks, AI provides substantial gains. For complex tasks, gains diminish and can turn negative. The mechanism involves AI's limitations with tasks requiring deep domain expertise, contextual judgment, or integration of diverse information sources.
Vendraminelli et al. map the structural shape of the proxy-criterion gap. The wall effect describes exactly where proxy metrics track accountable criteria (routine, decomposable tasks: both show gains) versus where they diverge (complex, judgment-intensive tasks: proxy metrics may show gains while criterion-level outcomes deteriorate). PSF uses the wall effect to explain why proxy seduction is self-reinforcing: organizations that evaluate AI through market-logic metrics will observe gains across the portfolio (weighted toward routine tasks where the proxy tracks the criterion) while missing that their most complex, judgment-intensive work is deteriorating.
Fernandes, D., Lynch Jr., J.G., Dalton, A.N. and Netemeyer, R.G. (2026) 'AI makes you smarter but none the wiser', Computers in Human Behavior.
Two large-scale studies (N=246, N=452). Task performance improved compared to norms. Participants believed they improved by a larger margin: an overestimation gap. More striking: participants with greater AI literacy were more confident in their judgments but less accurate. Higher AI literacy correlated with lower metacognitive accuracy.
Fernandes et al. directly support PSF's claim that proxy seduction cannot be corrected through training or AI literacy programs. The paradox (higher AI literacy correlates with lower metacognitive accuracy) is what PSF predicts: training increases confidence in applying proxy metrics (participants know more about how to use AI effectively) while eroding the metacognitive monitoring that would detect proxy-criterion divergence.
Workday (2026) Beyond Productivity: Measuring the Real Value of AI. January.
3,200 employees and leaders, cross-industry, all full-time at organizations with over $100M revenue. 90%+ of daily AI users confident AI will help them succeed. 14% achieve consistently positive net outcomes. 37% of time saved through AI is lost to rework. 89% of organizations report fewer than half of roles updated to reflect AI capabilities.
Workday provides organizational-scale evidence for several PSF mechanisms simultaneously. The 90%-confident versus 14%-positive-outcomes divergence is the Barnesian performativity effect: field-level AI discourse has constituted confidence in AI value as a proxy for AI value itself, producing the belief independently of the outcomes that would justify it. The 37% rework finding is behavioral evidence of the proxy-criterion gap: practitioners perceive time savings (the proxy metric) while downstream quality costs (the accountable criterion) are externalized to rework and not attributed to AI. The 89% unchanged-roles finding is evidence of the form/function gap at organizational scale.
Stack Overflow (2025) Developers remain willing but reluctant to use AI: The 2025 Developer Survey results are here.
n=49,000. Adoption rose to 80% while trust in accuracy fell from 40% to 29% year-over-year. Positive favorability dropped from 72% to 60%. The leading frustration (45% of respondents) is AI-generated code that is "almost right, but not quite." Two-thirds report spending more time fixing such code than writing it would have required.
PSF reads the Stack Overflow data as behavior-belief decoupling: practitioners who can articulate the proxy-criterion divergence cannot exit the engagement because institutional infrastructure (organizational mandates, resource allocation decisions, Barnesian performativity of field-level discourse) has already committed to the proxy metrics. The vocabulary gap is the key PSF mechanism here: the "almost right" experience resists institutional legibility because market-logic vocabulary cannot encode it. The frustration is real, the practitioners experience it, but it does not travel up organizational channels as a signal that would interrupt displacement.
Cruces, G., Bergman, D., Morin, L. and Saez, E. (2026) 'AI Tutoring and the Scaffolding Trap: Evidence from Randomized Experiments', NBER Working Paper No. 34851.
Cruces and colleagues ran randomized controlled trials of AI tutoring across multiple educational contexts. Students using AI tutoring showed 75% of the education gap closure achieved by human tutoring on immediate post-tests. However, gains dissolved without AI support: on delayed assessments without AI access, AI-tutored students showed no persistent learning advantage over controls. The learning was scaffolded, not internalized.
Cruces et al. directly support PSF's judgment stock claim with experimental evidence at a different level of analysis (education) that transfers to organizational knowledge work. The 75% gap closure is the proxy metric: students appear to have learned. The dissolution without AI is the criterion-level reality: they have not built durable knowledge structures. For PSF, Cruces et al. provide the sharpest empirical warrant for the claim that proxy seduction does not just displace current criteria but degrades the pipeline through which criterion-level competence develops.
Humlum, A. and Vestergaard, E. (2025) 'The Labor Market Effects of Generative AI', American Economic Review. (In press)
Humlum and Vestergaard analyze a natural experiment in Denmark: the staggered rollout of ChatGPT across firms and industries. Despite high adoption rates across the Danish labor market, they find no detectable impact on wage levels, employment levels, or hours worked. The perception of AI's labor market impact is substantially more positive than the measured impact.
Humlum and Vestergaard provide the field-level evidence for PSF's core claim. The null effect is what PSF predicts when Barnesian performativity has constituted proxy metrics (individual productivity perceptions, immediate output quality, speed of task completion) as the evaluative vocabulary while aggregate criterion-level outcomes (labor market productivity, wage growth) remain unaffected. Proxy seduction at scale: confident practitioners, confident organizations, zero aggregate outcome change.
Gimbel, M., Kinder, M., Kendall, J. and Lee, M. (2025) 'Evaluating the impact of AI on the labor market', Yale Budget Lab.
No significant AI-related labour displacement in occupations with high AI exposure. The gap between the performative frame (AI is displacing workers) and the empirical reality (no detectable displacement) is itself diagnostic.
PSF reads Gimbel et al. as counter-evidence to performative displacement claims. The frame performed headcount reduction into the position of a criterion, and the proxy-criterion divergence arrived on schedule. Paired with IBM's hiring reversal: CEO claimed AI replaced several hundred HR employees, then nine months later IBM tripled entry-level hiring after cutting junior roles collapsed the talent pipeline.
Srinivasan, S., Hoffman, M. and Nandkumar, A. (2026) 'AI Engagement and the Labor Economics of Knowledge Work', Harvard Business School Working Paper.
Srinivasan, Hoffman, and Nandkumar argue the displacement-or-complement question is misspecified because the framing assumes organizations accurately observe which outcome is occurring. Short-run complementarity appearance, medium-run displacement signals.
Srinivasan et al. contribute a specific empirical challenge to PSF's framing against incumbent accounts. Organizations are making workforce decisions using proxy metrics (headcount, output volume, cost per task) that cannot detect the capability requirement shifts their medium-run data documents. The short-run appearance of complementarity maps onto PSF's temporal structure: proxy seduction produces felt complementarity while criterion-level capability requirements shift in ways not captured by the proxy evaluation apparatus.
Rathje, S., Ye, M., Globig, L.K., Pillai, R.M., Oldemburgo de Mello, V. and Van Bavel, J.J. (2025) 'Sycophantic AI increases attitude extremity and overconfidence', PsyArXiv preprint, DOI: 10.31234/osf.io/vmyek_v1. Invited revision at Nature.
Three pre-registered experiments (N=3,285 total) across four politically charged topics (gun control, abortion, immigration, universal healthcare) and four large language models (GPT-4o, GPT-5, Claude, Gemini). Participants randomly assigned to sycophantic, disagreeable, unprompted, or control chatbot conditions. Core findings: people consistently preferred sycophantic chatbots over disagreeable ones; brief conversations with sycophantic chatbots causally increased attitude extremity and certainty, whereas disagreeable chatbots decreased both; participants perceived sycophantic chatbots as unbiased and disagreeable chatbots as highly biased; the attitude-extremity effect was driven by one-sided fact presentation while enjoyment was driven by validation; sycophantic chatbots inflated "better than average" self-perceptions on desirable traits including intelligence and empathy.
Rathje et al. provide the strongest causal evidence in the constellation for cognitive-level proxy-criterion divergence. The self-concealing signature (users perceive the biasing condition as unbiased and the non-biasing condition as biased) is the cleanest experimental demonstration of detection erosion: users cannot detect the very mechanism shifting their attitudes, and in fact misattribute bias to its opposite. The distinction between validation-driving-enjoyment and one-sided-fact-presentation-driving-attitude-shift separates the engagement channel from the constitutive channel, which maps onto PSF's distinction between why people keep engaging and what the engagement does to them. The inflated better-than-average effect compounds the mechanism: the engagement shifts attitudes and inflates evaluative self-confidence about the shifted attitudes, which is the mechanism by which braking fails (no signal that anything is wrong). This is the closest psychological-experimental analogue to METR's perception-reality gap, now with random assignment. The conceptual companion piece is in 3.1 (Rathje and Van Bavel). The strategy-scholar retreat from this research programme is documented in 5.3 (Foss).
The "algorithmic failure" passage (ByteDance Institute of Public Policy research fellow, 2024 interview), quoted in Wan et al. (AoM Proceedings, 2025), pp. 13-14.
The passage in full, as a first-person internal observation of proxy-criterion decoupling at ByteDance, in the firm's own operational language:
[Information encountering] is designed to prevent opinion polarization and the creation of echo chambers... During our visits to universities, we found that many professors and students would uninstall Douyin after using it for a while because they noticed the algorithm was narrowing their content recommendations, [which is] an algorithmic failure. To prevent users from getting bored, [the algorithm,] for instance, knows you like European football but will recommend an American football video every five posts to create an "information encountering"... [This is essentially] a dynamic balance between short-term and long-term strategies to ensure long-term user retention.
Four structural components map directly onto PSF's mechanism. First, the accountable criterion is operationally named: users find content valuable. The detection signal is behavioural: professors and students, practitioners with developed judgment in their domains, uninstall the app when they notice the algorithm has narrowed their content in ways that no longer serve them. Second, the firm's internal vocabulary for the decoupling is specific: "algorithmic failure." The firm has an operational term for the moment when the proxy metric (retention) comes apart from the criterion (users find content valuable). Third, the firm's response is to engineer around the proxy rather than interrogate it. "Information encountering" diversifies the content mix specifically to sustain retention. Fourth, the operational substitution is explicit in the firm's own voice: "ensure long-term user retention." The accountable criterion is replaced with an instrumental criterion (users do not uninstall), which is proxy seduction in the sincere-belief sense PSF theorizes.
The passage does specific work the other 5.1 entries do not. It documents the firm recognizing the proxy-criterion gap operationally and choosing to engineer around it rather than interrogate it, captured in real time. The passage appears inside a peer-reviewed paper (Wan et al., SMJ forthcoming) whose theoretical frame does not register it. The authors and reviewers read the passage as an illustration of algorithmic sophistication. The substitution structure the passage documents does not enter the paper's analysis. This double-layered invisibility (operational recognition by the firm, conceptual non-recognition by the scholarship that reports the firm's words) is why the passage sits in 5.1 Direct Evidence and the paper that contains it sits in 4.4 Incumbent Frameworks PSF Challenges. Quotation handling: the original source is the ByteDance Institute of Public Policy research fellow's 2024 interview; page numbers will change when the SMJ version publishes.
Individual output quality (the proxy) improves while collective diversity (an accountable criterion for organizations whose competitive advantage depends on portfolio distinctiveness) degrades. Market-logic metrics capture the individual improvement. Professional-logic criteria register the collective loss, which is invisible to the evaluation instruments organizations typically use.
Doshi, A.R. and Hauser, O.P. (2024) 'Generative AI enhances individual creativity but reduces the collective diversity of novel content', Science Advances, 10(28).
AI assistance increased individual creativity scores: stories produced with AI assistance were judged as more creative, better written, and more engaging by blind raters. But collective diversity collapsed: the AI-assisted stories were significantly more similar to each other than the unassisted stories. The mechanism: AI models draw on the same training distribution, suggesting similar narrative moves. Each user finds these helpful. The result is individual improvement within a converging distribution. This is the signature proxy-criterion divergence at the collective creative level.
Anderson, J. et al. (2024) 'Homogenization Effects of Large Language Models on Human Creative Ideation', Working Paper.
N=1,100 participants brainstorming solutions to social problems. AI assistance increased fluency and confidence but reduced semantic diversity across participants. The effect was dose-responsive: more AI use, more convergence. The convergence effect was strongest for participants with lower baseline creativity. The dose-response relationship maps onto PSF's engagement intensity prediction: heavier AI engagement produces greater convergence, which produces greater divergence between the individual quality proxy and the collective diversity criterion.
Meincke, F., Collins, H. and Evans, R. (2025) 'Idea Overlap Between AI and Humans', Science Advances, 11(14).
Human ideas showed 100% uniqueness: each human generated ideas no other human had generated. AI ideas showed 94% overlap: 94% of AI-generated ideas were duplicates. The mechanism draws directly on Collins's framework: AI systems generate from the same distributional space. Human ideas draw on individual socialization, embodied experience, and community membership. PSF uses this as the clearest available quantification of proxy-criterion divergence: speed and volume (proxy) high, distinctiveness (criterion) at 6%. The 94-percentage-point gap is the sharpest available measurement.
Moon, C., Suh, S. and Lim, J. (2025) 'AI Assistance and Creative Convergence', Management Science. (In press)
Structural mitigations (team diversity, explicit diversity prompts, multiple AI models) reduce but do not eliminate convergence. Even teams explicitly instructed to seek diverse AI perspectives show significantly higher convergence than human-only teams. Supports PSF's claim that proxy seduction is not correctable through interventions organizations would naturally deploy.
De Freitas, J., Henkel, L. and Cikara, M. (2025) 'The Convergence Effect', Psychological Science, 36(4), pp. 489-503.
Observers systematically prefer AI-assisted outputs when evaluating individual pieces but systematically underestimate convergence when evaluating them as a portfolio. The evaluation instruments observers apply to individual pieces are systematically insensitive to portfolio-level convergence. The proxy metric (individual piece quality) is actively misleading about the criterion: observers who apply individual-quality metrics prefer the outputs with the highest portfolio convergence, because the features that make individual outputs attractive are the features convergence optimizes.
These function as illustrations of PSF mechanisms operating in real time. They are not evidence for the theory; they are instances of what the theory explains. They are also constituents of the Barnesian performativity process that PSF describes.
Haupt, A. and Brynjolfsson, E. (2025) 'Centaur evaluations', ICML Position Paper.
The dominant evaluation paradigm assesses AI systems as potential replacements for human labor rather than as augmentors. Benchmarks encode automation logic. Organizations using these benchmarks inherit automation logic as their evaluative framework. The instrument constitutes the question. In PSF, Haupt and Brynjolfsson provide evidence of the material agencements (in Callon's sense) through which Barnesian performativity operates on AI evaluation.
Koren, E., Hazan, E. and Bar-Yossef, Z. (2026) 'Vibe Coding Kills Open Source', arXiv:2601.15494.
Downloads (proxy for adoption) rising while documentation traffic, issue engagement, and revenue (criteria for ecosystem generativity) falling. Tailwind CSS: downloads up, revenue down 80%. The metric measures something constitutively different from what it measured before AI mediation. Classic PSF: the evaluative vocabulary has not updated, so the divergence is invisible to the instruments maintainers use.
Yegge, S. (2026) 'The Eight Levels of Programmers', Blog post. January.
The "evolution" framing naturalizes evaluative discontinuity. The Level 8 developer is not a better programmer. They are a different kind of worker: a factory manager rather than a craftsperson. Practitioners reading Yegge may aspire to Level 8 without recognizing that achieving it involves abandoning the capabilities they currently value. PSF reads Yegge as the vampire problem in practitioner discourse: the one-way door dressed as a staircase. The framework constitutes "managing AI systems" as the criterion for programmer excellence while rendering professional-logic criteria as anachronistic.
Executive Claims Repository: Twilio CEO, Anthropic Self-Report, IBM, McKinnon/Okta, Katz/NYC H+H, and Others.
A growing collection sharing structural features: optimistic framing, no specification of what was measured or how, unchanged evaluative criteria, and absence of mechanism. When a CEO says "AI makes our developers 30% more productive" without specifying what 30% measures, the statement constitutes "developer productivity" as a category appropriately evaluated through AI-compatible proxy metrics and renders professional-logic criteria invisible. Each claim constitutes proxy metrics as legitimate evaluative categories. The aggregate is a real-time archive of Barnesian performativity in action. PSF reads these as illustrations of performative effects operating on the evaluation apparatus itself: conventional performativity routinises the metric vocabulary, generic performativity produces the conditions the claims describe, and framing performativity establishes what counts as evidence of AI value.
Raad, D. (2026): Five Mechanisms from Practitioner Experience. CEO of anoma.ly, February 2026. Documents five PSF mechanisms: (1) implementation cost as quality filter, (2) effort substitution from creation to prompting, (3) craftsperson adverse selection, (4) bottleneck displacement from building to evaluating, (5) hidden LLM integration costs. Strong face validity but anecdotal.
Henderson: AI in the Security Domain. Security professional perspective on AI integration risks. The security context amplifies consequences of proxy seduction because the gap between proxy and criterion can produce exploitable vulnerabilities.
Kahn (2026) 'The AI code glut', New York Times, 6 April; Landymore (2026) Futurism, 11 April.
A financial services company saw coding output increase tenfold after engaging Cursor, producing a backlog of one million lines of unreviewed code. All three PSF erosion dimensions are visible simultaneously: detection failure (no one can review at the rate AI produces), judgment stock erosion (organisations laying off experienced engineers whose review capacity the glut demands), braking failure (proposed fix is more AI reviewing AI, adding proxy layers rather than restoring evaluative capacity). Elvix CEO Sachin Kamdar names the self-concealing mechanism without theorising it: code will break and no one will know why, because no one understood it when it was written. Replit president Michele Catasta's observation that "everyone inside your company becomes a coder" is a fertile form observation (Faulkner and Runde): the AI tool constitutes new actors as coders, multiplying evaluative surface area while distributing evaluative responsibility to people who never developed judgment stock for it. Workforce paradox: AI cited in 54,000+ layoff announcements in 2025 while simultaneously generating work requiring more human oversight. Meta-observation: legacy codebases took decades to become opaque through author attrition; AI-generated code is orphaned from the moment it enters the codebase, compressing the judgment-stock erosion trajectory from years to hours.
Sziebert, C. (2026): The 18-Month Wall. Google Cloud AI leadership, practitioner observation.
Documents a pattern observed across engineers using AI coding assistants intensively over roughly eighteen months. Early productivity gains are real and measurable. The engineer ships more, closes tickets faster, reads as more productive on standard engineering metrics. At approximately eighteen months, a class of problem arrives that the AI cannot scaffold: debugging a novel failure mode, extending a system outside the training distribution, mentoring another engineer through material requiring causal understanding. The mental models that would have been built through unassisted practice were never built. The wall surfaces to the individual first, then to the organization as engineering capacity appears to have collapsed despite continued AI engagement. For PSF, Sziebert provides the practitioner-voice illustration of capacity drawdown with delayed surfacing (Section 3.2). Peer-reviewed individual-layer evidence comes from Shen and Tamkin, Bastani et al., and Beane. Sziebert's contribution is specificity: the wall names the self-announcing moment PSF predicts, and the eighteen-month timescale provides a tractable empirical target for the interview protocol.
Grennan (2026) 'Your Company Has an AI PR Problem', AI Mindset newsletter, 10 April.
Grennan's piece performs multiple PSF discourse traps simultaneously: stable-evaluator assumption (treats practitioners as unchanged subjects who simply need better information), resistance-as-obstacle (reframes legitimate evaluative friction as a communications failure to be solved), and friction-as-failure (proposes removing the option to not use AI by embedding it into processes). The 89%/65% NBER-Gallup gap (managers believe AI will benefit their company but only 65% of workers agree) is treated as an adoption problem rather than as a perception-reality gap diagnostic. Mollick-style move: "technology works, adoption does not." The "AI Process Architect" prescription (embed AI into workflows so usage becomes structural) is the fertile form argument operating at organisational design level: removes evaluative friction while constituting process compliance as the new proxy metric.
Axios AI+ Government newsletter (2026) 10 April.
Performs the policy-paper-as-alibi discourse trap: propose reforms that are dead on arrival, then claim credit for responsibility. Industrial Revolution analogy naturalises disruption and erases evaluative questions by framing AI as a force of nature rather than a set of organisational choices. Lehane frames Congress as bottleneck rather than deliberative site (resistance-as-obstacle at the institutional level). Delinea ad embedded in the newsletter contains a found data point: 87% of IT decision-makers confident in security posture while only 46% have adequate governance, a perception-governance gap that mirrors the METR perception-reality gap at the organisational level. The Delinea 87%/46% gap is a found PSF data point: confidence in security (proxy) diverges from governance adequacy (criterion). Not presented as PSF-relevant in the original source, which makes it diagnostic rather than curated.
Au Quan, A. (2026) LinkedIn post responding to Fast Company, 10 April.
Performs the stable-evaluator assumption ("taste becomes the differentiator"), amplification framing ("we don't get replaced, we get amplified"), resistance-as-obstacle (three-way standoff reframed as "radical hybrid collaboration"), and friction-as-failure (role convergence treated as progress). The reassurance checklist format does no analytical work: it names phenomena without examining the conditions under which they hold or fail. The "taste becomes the differentiator" claim is the stable-evaluator assumption in its purest form: it presupposes that taste (evaluative judgment) remains intact through the engagement. PSF's central claim is that this is precisely what cannot be assumed, because the engagement itself reconstitutes what the practitioner attends to and values. If taste were stable through engagement, there would be no proxy seduction problem. Filed alongside Hallowell (5.4) and Grennan as practitioner voices that correctly sense something shifting but reach for reassurance rather than diagnosis.
Foss, N.J. (2026) 'AI Isn't a "Rationalization Machine", It's a Motivation Amplifier', Notes from a Strategy Scholar substack, 17 April.
Strategy scholar Nicolai Foss (Copenhagen Business School) responds to Rathje and Van Bavel's conceptual piece by substituting the amplifier metaphor: AI is directionally neutral and "scales whichever direction we are already leaning." Foss rejects the rationalization-machine framing and prescribes prompting-as-cognitive-discipline as remedy. The retreat is partial and diagnostic. In the "Hidden Risk" section Foss names proxy-criterion decoupling directly (fluency-over-thinking, "better arguments, worse thinking," "more convincingly wrong") and then reframes it as an incentive-design problem. In "AI-mediated groupthink" he identifies a constitutive-like effect (the system makes it easier to construct internally consistent justifications) and reframes it as a choice-architecture problem. He concedes the "crucial twist" that AI actively generates rather than passively curates, then closes the possibility down by rhetorical question without argument. The two-employees example ("why is my strategy correct?" vs "strongest arguments against?") holds user motivation fixed as exogenous, which is precisely the assumption PSF rejects: engagement reshapes which questions feel natural to pose. Foss elides the Rathje et al. 2025 RCT finding that brief engagement causally shifted attitudes and certainty under random assignment, which directly contradicts the motivation-fixed framing.
The move is a classic strategy-scholarship retreat from constitutive accounts. Voluntarist technology frame ("design choices, not destinies"), RBV-style training-as-solution ("prompting as a management skill"), agency-theory incentive closure ("incentives will decide the outcome"). Each move fits strategy's toolkit. None can theorise engagement as constitutive because doing so would undermine the managerial-agency primitive the field rests on. PSF reads Foss as a discourse-trap exhibit: the amplifier move is widespread across strategy and management discourse (Mollick, Davenport-style augmentation rhetoric, Chang and Grant's HBR piece from January 2026), and this substack post is one tractable instance of the pattern. Same ontology as Sziebert (empowerment variant) and Hallowell (persona-diversity variant), pessimistic inflection.
Deloitte (2026) global revenue announcement; Yahoo Finance and Business Insider coverage, April.
Deloitte reported $70 billion in revenue (+5% year-over-year) and simultaneously announced cuts to its workforce package: parental leave halved (16 weeks to 8), pension accruals eliminated, paid time off cut by up to 10 days, fertility assistance and adoption reimbursement scrapped. A Wharton commentator quoted in Business Insider supplied the rationale: "with the job market slack, they feel they can." Read as power dynamics, the cuts are a story about labor-market leverage. Read through capacity drawdown at the institutional-form layer (Section 3.2.1), the cuts become legible as downstream effects of a shift in what counts as evaluable contribution. When AI engagement renders individual contributors into legible-and-transferable units, the implicit valuation of judgment-formation conditions that historically justified benefits packages weakens, and the cut becomes thinkable at a level it was not before. The case is the cleanest contemporary anchor for the institutional-form layer extension and is AI Alibi-adjacent: revenue growth coincides with workforce package erosion that AI capability narratives normalize. Filed with Block/Dorsey, Klarna, Amazon, Accenture, Baker McKenzie, and Commonwealth Bank in the AI Alibi corpus, with a sharper institutional-form signature: the cuts target benefits packages (substrate) rather than headcount.
Diamandis, P. (2026) 'The Organizational Singularity Is Here', MetaTrends substack.
Diamandis articulates the "organizational singularity" claim: AI agents collapse Coasean transaction costs to the point where a solo operator with AI substitutes for a firm of dozens. The claim is presented as accomplished rather than as a hypothesis. The empirical basis for the substitution claim is thin (the post does not engage with Massenkoff and McCrory, Humlum and Vestergaard, Gimbel et al., or any of the empirical literature that bears on whether the substitution is materializing at the rate the claim implies). The structural rhyme with Krueger and Sigman's "Bitcoin One Million" Table 14.1 is precise: both treat existing institutional arrangements as friction the technology is clearing, both organize their argument around what dies rather than what materializes, and both pivot to second-order argument (entrepreneurship policy, regulatory reform) without interrogating the empirical premise. The post explicitly cites Coase: "Coase's Law is now dead. You must become an entrepreneur... now!" PSF reads the move as field-level performativity at the institutional-form layer (Cabantous and Gond's framing performativity): field-level discourse establishes what counts as evidence (Coasean substitution as accomplished) before any organization tests the claim operationally. The discourse makes contractor-ization decisions defensible at the level of strategy, accelerating the drawdown the discourse describes.
Dykstra, J.A. (2026) 'How A.I. Is Killing Full Time Employment (Bye-Bye, Social Contract!)', Hello Tomorrow newsletter, 28 April.
Dykstra accepts the Diamandis "organizational singularity" premise (calling it "directionally correct, I largely think it is") and pivots to a social-contract argument: in the U.S., the FTE is the delivery mechanism for healthcare, retirement, parental leave, and disability coverage, so killing the FTE kills the social contract. The proposed response is The Commons: a structural separation in which life-essentials are decoupled from market provision and made universal. The framing is left-flank in political register, but structurally identical to the right-flank Diamandis version: both share the unexamined premise that AI is materially substituting for human organizational work at the rate the discourse claims. PSF's diagnostic question (is the substitution claim materializing, or is engagement substituting for the evaluative capacity that would let us tell?) does not land for either version because both have already accepted the answer. The Deloitte case is read by Dykstra as power dynamics, full stop. PSF reads the same case as institutional-form drawdown: the evaluative case for the benefits package weakens as AI engagement renders individual contributors into legible-and-transferable units. Dykstra's piece is methodologically valuable because it shows the capability-as-replacement trap is not ideologically marked. The pro-capital and pro-worker versions of the trap share the empirical premise PSF interrogates. Filed alongside Diamandis 2026 in the foil pool, paired with Au Quan and Grennan as adjacent left-leaning practitioner exhibits performing variants of the trap.
Positions PSF engages critically. These are genuine interlocutors whose work PSF takes seriously, not strawmen.
Mollick, E.: AI Augmentation Optimism. Primary counterposition. Mollick emphasizes AI as augmenting human capability, treats engagement as straightforwardly beneficial, and interprets productivity data through an optimistic lens. PSF argues this framing misses the transformative-experience dimension: AI engagement changes the evaluator, not just the output. Mollick is a thoughtful interlocutor who updates positions.
Brynjolfsson, E.: Productivity and Complementarity. J-curve, GDP-B, Turing Trap, centaur benchmarks, "canaries" (young worker displacement). Treats the productivity gap as a measurement plus timing problem. PSF divergence: Brynjolfsson assumes organizations will recognize the J-curve dip and invest out. PSF asks what happens when engagement erodes the evaluative capacity to recognize the dip. Strategy-level analysis, not evaluative-capacity-level.
Bailey, D. and Brynjolfsson, E.: AI Productivity Studies. Specific empirical work. PSF accepts the measured effects but questions whether the metrics capture the right things. The disagreement is about scope and interpretation.
Eismann, E.: UX Research and AI Integration. Represents the view that well-managed AI integration is straightforwardly beneficial. PSF argues that even well-managed integration may erode evaluative capacity.
Choudary, S.P.: Platform Dynamics and AI. Platform economics framing may miss the evaluative capacity dimension.
Sziebert (Google Cloud AI). Proxy substitution framed as empowerment. Role titles frame displacement as promotion. UI assumes evaluative capacity it needs. "18-Month Wall."
Hallowell (LinkedIn, March 2026). Multi-agent personas mask stochastic homogeneity. Emergent sycophancy at system level.
Hallowell (2026) LinkedIn, March.
Multi-agent personas mask stochastic homogeneity. Practitioner-voice observation that deploying multiple AI agents with distinct personas produces apparent diversity of response while generating outputs that converge statistically. The personas are a proxy for genuine cognitive diversity. The criterion (actual diversity of reasoning and output) is not measured. PSF reads Hallowell as a sibling to Moon et al. (Section 5.2) at the system-design level: structural mitigations (persona diversity) reduce but do not eliminate convergence. The persona wrapper is a fertile form (Faulkner and Runde) applied at the interaction layer: same underlying model, different positionings, different constituted functions, but shared distributional ground that the proxy metric cannot see. Emergent sycophancy at system level compounds the effect. Filed alongside Au Quan and Grennan as practitioner voices that correctly sense something shifting but reach for reassurance rather than diagnosis.
OpenAI Usage and Productivity Data. Provider-generated data with clear commercial interests, useful as evidence of proxy metric generation rather than as independent measurement.
Uplevel Developer Productivity Data. Engineering analytics data on developer productivity. Platform-specific measurement: the measurement methodology shapes what counts as productivity.
Massenkoff, M. and McCrory, E.: Labor Market Analysis. Used in "The AI Alibi" to ground workforce displacement claims in labor market data. Captures aggregate effects, not micro-level evaluative capacity erosion.
Stanford Digital Economy Lab: Canaries in the Coal Mine? Early warning indicators in AI-affected labor markets. Integrated into AI Alibi v5. Macro-level economic analysis.
Barcaui (2025). Referenced across three locations in the PSF paper per March 2026 revision notes.
Rabanser et al. (2026): Princeton HAL System Properties Assessment. arXiv 2602.16666. Technical system properties do not equal organizational evaluative capacity. Requires explicit level-mismatch acknowledgment when cited.
Eloundou, T., Manning, S., Mishkin, P. and Rock, D. (2024) 'GPTs are GPTs: Labor market impact potentials of large language models.'
80% of the US workforce is exposed to LLMs across at least 10% of their tasks. Grounds the substrate breadth argument: AI operates on language, the medium through which most knowledge work is conducted. This breadth distinguishes generative AI from prior automation waves, which operated on narrower task substrates. The breadth is what makes fertile form consequential at scale and what makes proxy seduction a field-level rather than niche phenomenon.
Fisher, G., Mayer, K.J. and Morris, S. (2021) 'From the Editors: Making Theory-Empirics Dialogue Work', Academy of Management Review, 46(4), pp. 695-706.
Fisher, Mayer, and Morris introduce phenomenon-based theorizing as a distinct and legitimate approach to theory development. The phenomenon is the starting point: the researcher begins with an empirical puzzle that existing theory cannot adequately explain. Theoretical resources are borrowed from multiple literatures as needed. The contribution is measured by how well the theoretical architecture explains the phenomenon, not by how deeply it extends a single theoretical tradition. The four-resource PSF architecture (Paul, Faulkner and Runde, Thornton et al., MacKenzie/Callon) is justified by this logic: remove any one resource and the explanatory architecture has a gap. Paul explains why pre-engagement evaluation is epistemologically unavailable; Faulkner and Runde explain why the same technology produces different proxy-criterion gaps; institutional logics explains the displacement mechanism; MacKenzie explains how proxy metrics become self-reinforcing. Fisher et al. belong in the cover letter and in the theoretical development section's integration paragraph.
Alvesson, M. and Sandberg, J. (2011) 'Generating Research Questions Through Problematization', Academy of Management Review, 36(2), pp. 247-271. Distinguishes problematization from gap-spotting. PSF's contribution is identifying a shared assumption (evaluative continuity) that existing theories take for granted. Governs the critique sections of the paper.
Elsbach, K.D. and Van Knippenberg, D. (2020). Justifies combining literatures that do not normally speak to each other.
Mayer, K.J. and Sparrowe, R.T. (2013). Approach 4: shared explanatory mechanism across literatures. Specifies how the combination is structured.
Lakatos, I. (1970). Resources must be jointly necessary, individually insufficient, and the combination must generate predictions none could make alone. The falsification conditions in the PSF paper are structured against these criteria.
Whetten, D.A. (1989) 'What Constitutes a Theoretical Contribution?', Academy of Management Review, 14(4). AMR touchstone. PSF maps: what (proxy metrics, evaluative capacity, judgment stock), how (displacement through logic asymmetry), why (sincere belief through constitutive transformation), boundaries (transformative technology, not all technology).
Corley, K.G. and Gioia, D.A. (2011). Both scientific utility and practical utility. PSF's evaluative capacity dimensions generate organizational diagnostics.
Davis, M.S. (1971) 'That's Interesting!' Theories that succeed challenge an assumption. PSF's core move: what seems to be productivity improvement is actually proxy substitution.
Cornelissen, J. (2017). Creative combination of resources that do not normally sit together. The creative combination is what makes PSF harder to position but also what makes it worth reading.
Lomellini: Four Ingredients of Theory Building. Core concepts, linkages, mechanisms, boundary conditions. PSF was stress-tested against this framework. Boundary conditions could use the most explicit treatment in the paper.
The full causal chain for PSF runs as follows. Field-level AI discourse, operating through material agencements (MacKenzie, Callon) and the three performativity channels Cabantous and Gond identify (conventional, generic, framing), constitutes market-logic proxy metrics as the legitimate vocabulary for evaluating AI engagement before any organisation tests their validity. Organizations inheriting this vocabulary enter engagement with criteria already shaped by the performative process (Thornton et al., Ocasio et al.). Within organizations, practitioners experience AI engagement as transformative in Paul's sense: their capacity to evaluate outcomes is restructured by the engagement itself. The technology's fertile form (Faulkner and Runde) enables positioning drift through practice: what began as a productivity tool acquires different functions as practitioners tune to its outputs (Pickering) and as boundary conditions reconfigure (Barrett et al.). The proxy metrics colonize practical reasoning through their competitive advantage in legibility (Nguyen), and the hard choices through which judgment would have been built disappear from experience (Chang). Market logic primes practitioners to attend to proxy metrics (Weber and Glynn, Weick) while professional logic criteria undergo salience decay (Schatzki). Collective tacit knowledge (Collins, Polanyi) that would enable proxy-criterion discrimination degrades as the developmental conditions that sustain it (Beane, Dreyfus, Endsley) are removed by AI engagement. Organizational defensive routines (Argyris) protect the proxy narrative from disconfirming evidence. March's exploitation bias ensures that visible exploitation gains crowd out invisible exploration losses. The result is asymmetric ambidexterity: organizations succeed visibly at market-logic evaluation while professional-logic evaluative capacity erodes unmeasured.
The boundary condition for PSF (what distinguishes proxy seduction from Goodhart's Law) is sincere belief. The organizational actors in PSF are not gaming the metric. They are using the metric as the criterion because their evaluative framework has been constituted by the engagement to make the proxy metric feel like the criterion. This is what Gioia et al.'s adaptive instability explains at the identity level, what Schatzki's teleoaffective structures explain at the practice level, what Nguyen's value capture explains at the philosophical level, what Chang's parity elimination explains at the agency level, and what Shaw and Nave's cognitive surrender explains at the cognitive level. The sincerity is what the incumbent literature lacks vocabulary for and what PSF supplies.
The empirical prediction follows from the full chain: wherever AI engagement is intensive and prolonged, the proxy-criterion gap should widen over time, the detection capacity of practitioners should erode across cohorts, and the organizational braking mechanisms should fail to activate even when individual practitioners with high judgment stock signal concern. METR, Daniotti et al., Bean et al., Brynjolfsson et al., Leonardi and Leavell, DORA, Stack Overflow, Workday, Humlum and Vestergaard, Gimbel et al., and Cruces et al. all provide evidence consistent with these predictions, from individual to organisational to field level.