← Back to toolkit
Proxy Seduction Framework

The Gap Has Structure: Empirical Evidence + Paper

Tab 1 presents an empirical analysis of the Anthropic Economic Index data. Tab 2 contains the full paper text. Click any underlined citation in the evidence tab to jump to the relevant paper section.

1. The question

Massenkoff and McCrory (2026) report that Computer & Mathematical occupations have 96% theoretical AI feasibility but only 32% observed usage. The standard interpretation treats this 64-point gap as headroom: tasks that AI could do but hasn't yet, due to diffusion lag, tooling gaps, or awareness barriers. If this were true, the gap should distribute roughly randomly across task types.

The Proxy Seduction Framework predicts something different: the gap should have structure. Specifically, tasks involving accountability, oversight, evaluative judgment, and trust relationships should systematically resist conversion, not because they are technically infeasible, but because they carry constitutive properties (consequences, verification requirements, accountability chains) that cannot be proxied by the metrics AI engagement makes attractive (speed, volume, certainty). The resistance should track the dimensions PSF identifies, not the dimensions a diffusion story would predict.

2. Method

We use Anthropic's published task-level data: observed Claude usage percentages for each of 19,530 O*NET task statements (from task_pct_v2.csv in the Anthropic Economic Index, HuggingFace CC-BY). We restrict analysis to the 10 occupational categories with theoretical AI feasibility of 68% or higher (from Eloundou et al. 2023, as reported in Massenkoff & McCrory 2026), yielding 9,694 tasks. We classify tasks into four types using keyword matching on O*NET task statements:

PSF-Resistant (1450 tasks)

Tasks containing: authorize, certify, approve, license, compliance, regulatory, oversee, supervise, audit, inspect, monitor compliance/safety/quality, enforce, investigate, counsel, advise client/patient, mediate, negotiate, diagnose, prescribe, treatment plan. These map to consequence exposure, braking, and detection dimensions.

Information Processing (1238 tasks)

Tasks containing: write, draft, prepare report/document/presentation, analyze data, compile, code, program, develop software, create content/document, translate, summarize, transcribe, edit, proofread, format, calculate. These are tasks where output mimicry makes proxy-criterion divergence hardest to detect.

Physical (904 tasks)

Tasks containing: operate equipment/machine, install, repair, construct, assemble, weld, transport, deliver, prepare/serve food, administer medication/injection. Control category: material braking enforces consequences the proxy cannot substitute for.

Other / Unclassified (6102 tasks)

Tasks not matching any of the above keyword sets. Serves as baseline comparison.

Limitation: keyword classification is transparent and reproducible but imprecise. A validation sample with expert coding would strengthen the finding. Even so, the systematic pattern across all categories is difficult to attribute to classification noise alone.

3. Headline finding

The conversion rate (share of tasks showing any observed Claude usage) differs dramatically by task type. Information-processing tasks convert at 46.4%. PSF-resistant tasks convert at 19.4%. Physical tasks convert at 19.9%. PSF-resistant tasks perform statistically identically to physical tasks, despite having no physical barrier. This is consistent with the framework's claim that the barrier is constitutive, not temporal.

Observed AI usage rate (% of tasks with any Claude usage) by task type. High-theoretical categories only (≥68%).

4. The pattern holds across every category

The gap is not an artifact of composition. In every single high-theoretical category, information-processing tasks convert at a higher rate than PSF-resistant tasks. The consistency mirrors the framework's claim about fertile form: AI's domain generality means proxy elevation operates across organizational functions simultaneously, but the constitutive barriers hold everywhere too.

5. What the gap looks like at the task level

The unused PSF-resistant tasks below are not technically harder for an LLM. Claude could draft an audit plan or generate a compliance review. The barrier is that these tasks carry accountability weight that cannot be proxied: consequences flow from authorization, verification structures require human judgment to remain legible, and trust relationships depend on the evaluator remaining a recognizable accountable agent. This is the divergence in kind the paper identifies: no amount of refining the proxy brings it closer to the criterion, because the two are not on the same dimension.

ZERO USAGE PSF-resistant tasks in high-theoretical categories

Life, Physical, and Social Sci / Epidemiologists

Plan and direct studies to investigate human or animal disease, preventive methods, and treatments for disease.

Office and Administrative Supp / Interviewers, Except Eligibility and Loan

Supervise or train other staff members.

Management / Biomass Power Plant Managers

Supervise biomass plant or substation operations, maintenance, repair, or testing activities.

Management / Hydroelectric Production Managers

Supervise or monitor hydroelectric facility operations to ensure that generation or mechanical equipment conform to applicable regulations or standards.

Sales and Related / First-Line Supervisors of Retail Sales Workers

Direct and supervise employees engaged in sales, inventory-taking, reconciling cash receipts, or in performing services for customers.

Business and Financial Operati / Regulatory Affairs Specialists

Direct the collection and preparation of laboratory samples as requested by regulatory agencies.

Architecture and Engineering / Industrial Engineering Technologists

Oversee or inspect production processes.

Management / Investment Fund Managers

Prepare for and respond to regulatory inquiries.

Business and Financial Operati / Claims Examiners, Property and Casualty Insurance

Supervise claims adjusters to ensure that adjusters have followed proper methods.

Business and Financial Operati / Government Property Inspectors and Investigators

Coordinate with or assist law enforcement agencies in matters of mutual concern.

Management / Public Relations and Fundraising Managers

Assign, supervise, and review the activities of public relations staff.

Management / Property, Real Estate, and Community Association Managers

Negotiate short- and long-term loans to finance construction and ownership of structures.

Educational Instruction and Li / History Teachers, Postsecondary

Supervise undergraduate or graduate teaching, internship, and research work.

Management / Compliance Managers

Monitor compliance systems to ensure their effectiveness.

Computer and Mathematical / Database Administrators

Select and enter codes to monitor database performance and to create production database.

HIGH USAGE Information-processing tasks in the same categories

Computer and Mathematical / Computer Programmers

Correct errors by making appropriate changes and rechecking the program to ensure that the desired results are produced.

2.4551%

Computer and Mathematical / Computer Programmers

Write, update, and maintain computer programs or software packages to handle specific jobs such as tracking inventory, storing or retrieving data, or controlling other equipment.

2.3979%

Computer and Mathematical / Computer Programmers

Write, analyze, review, and rewrite programs, using workflow chart and diagram, and applying knowledge of computer capabilities, subject matter, and symbolic logic.

2.1901%

Arts, Design, Entertainment, S / Actors

Write original or adapted material for dramas, comedies, puppet shows, narration, or other performances.

0.8925%

Educational Instruction and Li / Archivists

Select and edit documents for publication and display, applying knowledge of subject, literary expression, and presentation techniques.

0.7806%

Arts, Design, Entertainment, S / Editors

Prepare, rewrite and edit copy to improve readability, or supervise others who do this work.

0.7392%

Computer and Mathematical / Computer Systems Analysts

Review and analyze computer printouts and performance indicators to locate code problems, and correct errors by correcting codes.

0.6461%

Computer and Mathematical / Software Quality Assurance Engineers and Testers

Perform initial debugging procedures by reviewing configuration files, logs, or code pieces to determine breakdown source.

0.646%

Computer and Mathematical / Web Developers

Write, design, or edit Web page content, or direct others producing content.

0.5628%

Computer and Mathematical / Web Developers

Maintain understanding of current Web technologies or programming practices through continuing education, reading, or participation in professional conferences, workshops, or groups.

0.4349%

6. What this means for PSF

The central finding: Information-processing tasks convert theoretical feasibility into observed usage at 2.4x the rate of accountability/judgment/trust tasks (46.4% vs 19.4%). PSF-resistant tasks convert at virtually the same rate as physical tasks (19.4% vs 19.9%), despite having no physical barrier. This pattern holds across every high-theoretical category.

The diffusion interpretation ("the red will grow to cover the blue") treats the gap as temporal. This analysis suggests it is constitutive. The tasks that remain unused are not waiting for better models or more tooling. They are tasks where the very properties that make AI attractive (speed, scale, consistency) are orthogonal to the evaluative criteria that matter (accountability, professional judgment, trust, verification). Scaling does not close this gap. It widens it, by making the proxy metrics more attractive while leaving the constitutive barriers intact.

The pattern also has a methodological implication for how we read the Anthropic chart. Overlaying theoretical and observed as if one is a subset of the other encodes the temporal interpretation visually. Presenting them as independent variables, or breaking them down by task type as this analysis does, reveals the structural interpretation the overlay conceals. The evaluative continuity assumption is embedded in the chart format itself.

ABSTRACT

Why do experienced practitioners misjudge the effects of AI on their work? Across independent studies, developers perceive speedups where objective measurement shows the opposite, experts increase adherence to AI recommendations as output quality declines, and individual creative outputs improve while collectively homogenizing in ways no individual creator can detect. Existing theory cannot explain why the gap is perceptual rather than strategic, why expertise protects selectively rather than generally, and why field-level discourse amplifies rather than corrects the pattern. Conventional metric governance assumes practitioners can identify a proxy as a proxy, but proxy seduction operates through a sincere belief that no governance tool is designed to catch. This paper introduces the Proxy Seduction Framework, a mechanism-based account of how AI engagement constitutes attractive proxy metrics (speed, volume, certainty) that systematically displace the criteria organizations are accountable to, while eroding the judgment that would detect the substitution. Integrating institutional logics, transformative experience, the form/function distinction, and performativity, the framework specifies three dimensions of evaluative capacity (detection, judgment stock, braking) and traces their erosion across practitioner, organizational, and field levels. The framework is diagnostic: It identifies what to check, not what will happen.

INTRODUCTION

Every major framework designed to understand how organizations engage with transformative technologies (technologies where the engagement itself changes the criteria by which outcomes are judged) assumes the organization remains a competent judge of what the engagement produces. The institutional logics perspective explains how competing logics restructure what counts as success (Thornton, Ocasio, & Lounsbury, 2012) but does not ask whether the technology itself produces the confirming evidence that tips organizations toward one logic over another. This paper’s primary contribution is to specify the mechanism whereby AI engagement produces the confirming evidence that shifts what organizations treat as success, while eroding the evaluative resources that would catch the shift. Two adjacent literatures share the same assumption and are addressed by this paper’s claims. Sociomateriality recognizes that tools, and the organizations using them, shape one another through ordinary work practice (Orlikowski, 2007), but it assumes practitioners can evaluate how the technology has changed their practice using criteria the technology has not already changed. The dynamic capabilities framework makes the dependence on evaluative capacity explicit, because sensing, seizing, and reconfiguring each require a specific evaluative function (Teece, 2007), but assumes the sensing apparatus survives reconfiguration.

The failure is empirically visible at every level of analysis under AI engagement. At the individual level, a randomized controlled trial of 16 experienced open-source developers found that participants predicted a 24% speedup from AI assistance, perceived a 20% speedup after working, and were objectively measured at 19% slower, while code quality held equal across conditions (METR, 2025b). At the organizational level, a study of 5,172 customer service agents found their average productivity rose by 15% with AI assistance, but expert agents showed negligible improvement and measurable quality decline while adhering more to AI-generated recommendations over time (Brynjolfsson, Li, & Raymond, 2025). At the population level, Danish administrative data across 11 AI-exposed occupations found that self-reported time savings averaged 2.8% of work hours, yet administrative records showed null effects on earnings and hours, ruling out changes above 2%, even among daily users and early adopters reporting substantial benefits (Humlum & Vestergaard, 2025). The gap between perceived and measured outcomes is not random noise. It is patterned (perception consistently exceeds measurement), persistent (it holds across skill levels and domains), and resistant to expertise (experienced practitioners show it on dimensions the engagement newly constitutes, even where their domain judgment remains intact).

The evaluative continuity assumption survives not because scholars have overlooked the evidence, but because each framework explains something without reaching the mechanism that produces the pattern. Within the institutional logics perspective itself, the gap is specific: The framework explains how competing logics restructure evaluative criteria, but not how a technology can constitute the confirming evidence that selects between logics while remaining invisible to the organization undergoing the selection. Goodhart’s Law (Chrystal & Mizen, 2003) explains metric corruption through strategic gaming, but the METR developers had no incentive to overstate gains, and the 39-point gap persisted through sincere belief, not optimization. Measurement-lag explanations hold that general-purpose technologies require organizational reorganization before gains materialize (Brynjolfsson, Rock, & Syverson, 2017), but these accounts presume the evaluating organization remains stable enough to recognize when gains have materialized, which the evidence above contradicts. Transformative experience theory (Paul, 2014) explains why individual epistemic capacities shift through certain kinds of experience, but not how those shifts aggregate into organizational and field-level dynamics. Each framework illuminates a dimension of the problem. None specifies the mechanism through which the technology constitutes new observables, the evaluator is transformed through the engagement, institutional feedback loops reinforce proxy optimization, and field-level discourse establishes the evaluative frame before any organization has tested it operationally. The evaluative continuity assumption persists because no account has specified why, mechanistically, it should fail.

Through phenomenon-based theorizing (Fisher, Mayer, & Morris, 2021), this paper introduces the Proxy Seduction Framework (PSF), a mechanism-based account of how AI engagement constitutes attractive proxy metrics (speed, volume, apparent certainty) that systematically displace the criteria that organizations are accountable to, while eroding the judgment that would detect the substitution. The mechanism operates through sincere belief, not strategic gaming, which distinguishes it from Goodhart’s Law and makes it resistant to conventional metric governance. The framework is diagnostic rather than predictive: It identifies what to check, not what will happen in a given case, so that organizations can catch proxy seduction while they still have the evaluative capacity to act on what they find.

Throughout this paper, AI refers to LLM-based AI agents: systems that generate text, code, analysis, and other symbolic outputs in response to natural language prompts, and increasingly agentic systems that take actions in organizational environments. Three properties of the technology make proxy seduction especially acute. AI produces outputs that look like competent human work, making it difficult for practitioners to distinguish proxy-satisfying outputs from criterion-satisfying ones. AI operates on language, the medium through which organizations evaluate and communicate about work. And AI capabilities change faster than practitioners can form stable judgments about what the technology does.

The framework integrates four established theoretical resources (transformative experience, form/function distinction, institutional logics, and performativity) as sequentially linked theoretical injections (Fisher, Mayer, & Morris, 2021). Performativity (MacKenzie, 2006) constitutes which criteria are legitimate before engagement begins, form/function distinction explains why the technology can sustain those criteria, transformative experience explains how the evaluator is changed through engagement, and the institutional logics perspective explains how the shift is locked in. The contribution is the integration architecture, not the individual resources. The evidence base is a curated empirical constellation of studies that collectively test the framework’s components. No single study tests the full mechanism. Each case maps against specific claims, and the constellation establishes that the pattern is robust across domains, that modulating variables operate in the predicted direction, and that the framework’s falsification conditions remain open to empirical testing.

The paper proceeds as follows. The next section develops the evaluative continuity assumption and shows why it fails under AI engagement. A brief section establishes why AI is the revealing case. The core mechanism is then specified, with each theoretical resource introduced at the point where it does its analytical work. The empirical constellation tests the mechanism’s components. Falsification conditions identify where the framework could be wrong. The discussion addresses what the framework explains, what it does not, and what follows for practice, research, and policy.

THE EVALUATIVE CONTINUITY ASSUMPTION

Practitioner and academic frameworks share a foundational assumption: The organization engaging with AI persists as the same evaluating subject post-engagement, so that pre-engagement evaluation criteria remain valid for post-engagement assessment. Leading consulting firms make this assumption explicit. BCG allocates 70% of effort to “business transformation” but measures it against metrics defined before engagement begins (BCG, 2024). Deloitte tracks “potential to performance” through milestones defined at the outset (Deloitte AI Institute, 2024). McKinsey identifies “AI high performers” through maturity assessments and self-reported outcomes, treating perception as evidence (McKinsey, 2025). Each framework acknowledges transformation while measuring it against pre-engagement criteria that capture what existing infrastructure makes legible.

Academic literatures reproduce the assumption differently but with the same structural consequence. Sociomateriality (Orlikowski, 2007) recognizes that social and material dimensions are constituted together in everyday organizing practice, and Orlikowski’s constitutive entanglement is foundational to this paper’s argument: If technology were a fixed tool applied to a stable organization, proxy–criterion divergence would be a calibration problem. But sociomateriality assumes practitioners can evaluate their own reconstitution using criteria unaffected by that reconstitution. This paper builds on that foundation and adds a specific claim: The entanglement under AI engagement produces systematic evaluative error because the technology constitutes new observables more legible than the criteria the organization is accountable to, and that legibility asymmetry drives evaluation toward the proxy.

The institutional logics perspective (Thornton, Ocasio, & Lounsbury, 2012) explains how competing logics shape the criteria by which organizations recognize value, and how shifts between logics restructure what counts as success. The framework does not ask whether the technology itself might constitute the confirming evidence that tips the organization toward one logic over another, making the shift invisible to the organization undergoing it. Teece’s (2007) dynamic capabilities framework makes the dependence on evaluative capacity explicit, because each phase of the sensing–seizing–reconfiguring sequence requires a specific evaluative function. Under AI engagement, proxy seduction enters at every phase: Sensing registers proxy-level signals as confirmation, seizing selects commitments using criteria the engagement itself constituted, and reconfiguring assesses on the same proxies. The framework assumes the sensing apparatus survives reconfiguration. Proxy seduction identifies why this is not the case.

Table 1: Academic literatures and their limitations

Literature	What It Explains	Core Assumption	What It Cannot Explain
Sociomateriality (Orlikowski, 2007; Leonardi, 2011)	How social and material dimensions are constitutively entangled. Technology and organizational practice co-constitute each other. Technology is not a fixed object applied to a passive organization but participates in constituting organizational reality, including the evaluation criteria, evaluators, and practices through which evaluation occurs.	Entanglement is observable and theorizable from an analytical position. The constitutive relationship can be traced by the researcher, and by implication recognized by the organization undergoing it.	Why constitutive entanglement produces systematic evaluative error rather than neutral co-constitution. Under AI engagement, the technology constitutes new observables (speed, volume, apparent certainty) that are more legible than the criteria the organization is accountable to. This legibility asymmetry drives evaluation toward the proxy. Sociomateriality describes the entanglement but does not specify why it produces patterned proxy–criterion divergence invisible from inside the engagement.
Institutional logics (Thornton et al., 2012)	How competing logics (market, professional, bureaucratic) shape which aspects of organizational performance become legible and valuable. Explains how shifts between logics restructure what counts as success, and how field-level dynamics shape logic competition.	Organizations can recognize when a dominant logic obscures its own limitations. The technology’s role is to trigger or intensify competition between pre-existing logics rather than to constitute the evidence that drives logic selection.	The constitutive role of the technology in producing the metrics that drive logic selection. AI engagement constitutes the confirming experiences (speed gains, throughput improvements, stakeholder satisfaction) that reinforce market logic and starve professional logic of institutional resources. The technology does not merely trigger logic competition. It manufactures the operational evidence that resolves the competition in favor of the proxy, making the shift invisible to the organization undergoing it.
Dynamic capabilities (Teece, 2007)	How organizations strategically reconfigure to capture value from new technologies through a sensing–seizing–reconfiguring sequence: Organizations detect opportunities, commit resources, and restructure. Each phase depends on evaluative capacity.	Evaluative capacity remains intact across the engagement process. Sensing assumes the organization can distinguish genuine signals from noise. Seizing assumes the criteria used to evaluate commitments track accountable outcomes. Reconfiguring assumes the organization can assess whether restructuring produced intended results.	Evaluator transformation through engagement. Under AI engagement, proxy seduction enters at every phase. Sensing is compromised when proxy-level signals displace criterion-level signals. Seizing is compromised when the evaluation criteria used to select commitments are themselves constituted by the engagement. Reconfiguring is compromised when assessment operates on engagement-produced proxies. The sensing capacity that dynamic capabilities assumes is precisely what proxy seduction erodes.

Note: Each literature contributes genuine explanatory power within its domain. The limitation is not that these literatures are wrong but that each makes an assumption that prevents it from explaining the specific pattern of sincere, expertise-resistant evaluative failure observed under AI engagement. The Proxy Seduction Framework integrates four theoretical resources (transformative experience, form/function distinction, institutional logics as extended, and performativity) to address these gaps.

The assumption of evaluative continuity also structures explanations from economics and policy. Measurement-lag explanations hold that general-purpose technologies require organizational reorganization before gains materialize (Brynjolfsson, Rock, & Syverson, 2017). Policy-oriented analyses locate the barrier in worker capacity (Manning & Aguirre, 2026). Both share an implicit promise that addressing the identified barrier will cause gains to materialize, and both assume the evaluating organization remains stable enough to recognize when they have.

The alternative is that AI engagement transforms the evaluating organization such that pre-engagement criteria become structurally unreliable for post-engagement assessment. The claim is not that organizations transform (organizations routinely transform) but that the engagement constitutes new observables that become the de facto basis of evaluation, while the judgment that would distinguish these observables from the criteria to which the organization is accountable erodes through the engagement itself. The evaluation increasingly operates on proxies whose divergence from accountable criteria is invisible from inside the engagement.

WHY AI IS THE REVEALING CASE

The evaluative continuity assumption could fail under any transformative technology, but AI is where the failure is sharpest. Three properties of the technology’s form explain why. A contrast with an earlier general-purpose technology clarifies the distinction.

Output Mimicry and Artificial Certainty

AI produces outputs that are (almost) indistinguishable from competent human work: code that compiles and follows style conventions, prose that uses appropriate register, statistical analyses that deploy the right tests with proper formatting. This output mimicry generates what Leonardi and Leavell (2026) term “artificial certainty” in a two-step sequence. The output carries the formal markers of earned competence (confidence, completeness, procedural correctness) without the epistemic process that would produce that competence. Cruces et al. (2026) demonstrate this experimentally: AI closed 75% of the education-based productivity gap during task execution, but the gap reappeared in full when AI assistance was removed, confirming that the competence was scaffolded rather than internalized. The proxy (output quality) rises. The criterion (underlying competence) does not. And the formal similarity between the two makes the divergence difficult to detect.

Substrate Breadth

Most information technologies constitute proxies within a bounded domain: sales metrics through a CRM, operational metrics through an ERP. AI operates on language, the medium through which most knowledge work is conducted, evaluated, and communicated, making its fertile form compatible with nearly any organizational function. Eloundou et al. (2024) estimate that 80% of the US workforce is exposed to LLMs across at least 10% of their tasks. The proxies AI elevates therefore appear across most organizational functions, and their pervasiveness makes them look like the right metrics rather than convenient substitutes.

Pace of Capability Change

Earlier transformative technologies stabilized long enough for practitioners to develop feedback loops connecting evaluation to consequences. According to recent estimates (METR, 2025a), AI capabilities double roughly every 7 months, meaning the technology changes what engagement involves before practitioners have formed stable judgment about what they were using: Model capabilities shift between versions, new tool integrations alter workflows, and the evaluation target moves while the evaluator is still forming judgment about where it was.

Technologies lacking these properties (high output mimicry, broad fertile form, rapid capability change) represent boundary conditions where the framework predicts attenuated or absent proxy seduction, a prediction addressed in the falsification section.

The Electricity Contrast

Brynjolfsson, Rock, and Syverson’s (2021) productivity J-curve builds on David’s (1990) account of electricity adoption to demonstrate that general-purpose technologies require organizational reorganization before gains materialize, assuming that the organization’s evaluative capacity survives the reorganization. And for electricity, it did: Electricity changed how goods were produced, not how financial performance was assessed. But AI generates a layer of intermediate metrics (tasks completed, code merged, content produced) that arrive immediately while financial outcomes lag. Organizations optimize against those proxies before the lagging indicators reveal whether they track what they are supposed to track.

Table 2: Theoretical resources and their contributions to the Proxy Seduction Framework

Theoretical Resource	Level of Analysis	What It Explains	What PSF Adds
Paul (2014): Transformative experience	Individual practitioner	Engagement is epistemically and personally transformative. Rational anticipation of post-engagement experience is structurally unreliable.	Transformation does not change accountable criteria but changes what practitioners attend to when assessing whether criteria are met. The gap between proxy and criterion is reliably non-obvious from inside the engagement.
Faulkner & Runde (2019): Digital object ontology	Artifact and organizational context	Form underdetermines function. The same digital object realizes different functions depending on social positioning. AI’s form is unusually fertile.	Organizational positioning determines which metrics become salient and therefore which proxy–criterion gaps emerge. Different organizations engaging with the same technology produce different patterns of divergence.
Thornton et al. (2012); Weber & Glynn (2006); Gioia, Schultz, & Corley (2000): Institutional logics, typification, adaptive instability	Organization	Institutional logics constitute evaluation vocabulary. Typifications shape practitioner sensemaking. Organizations maintain stable identity labels while meaning shifts.	Market-logic learning propagates through the typification loop. Professional logic is occluded from its institutional feedback pathway. Adaptive instability masks the migration as continuity.
MacKenzie (2006); Callon (1998): Barnesian performativity	Field	Economic models constitute the markets they claim to describe through active constitution of evaluative categories.	Field-level discourse performs specific proxies (productivity, speed, headcount reduction) into legitimacy before organizations test whether those proxies track their criteria. The gap is difficult to discern and, where discerned, difficult to act on.

THE MECHANISM OF PROXY SEDUCTION

Output mimicry, substrate breadth, and capability pace combine to activate the mechanism. AI engagement transforms how organizations produce work and how they evaluate it, but the criteria used to assess AI-generated outputs are the same criteria developed before the engagement began. Pre-engagement, organizations evaluated the quality and relevance of their outputs against accountable criteria: the evaluative standards to which an organization bears responsibility, where failure produces consequences (legal liability, client loss, reputational damage, product failure, regulatory sanction) that cannot be absorbed by demonstrating performance on other dimensions. In practice, accountable criteria include robustness, defensibility, validity, and substantive quality of outputs, and business outcomes such as revenue growth and user growth. Properties like speed and volume of outputs, range of tasks completed, and artificial certainty were observable but functioned as subordinate proxies, because they were not reliably informative as indicators of business performance, or because production friction (the overhead of generating and capturing these metrics at scale) made them difficult to measure meaningfully, or both. AI engagement changes the status of these proxies by making them visibly responsive: They appear to move, they show progress, and they are legible. These proxies become de facto evaluation criteria because they are the most legible post-engagement indicators.

These engagement-elevated proxy metrics measure real properties, and organizations evaluating them are under no illusion about what they observe. But real is not the same as valuable. The proxies seduce because they are legible, responsive, and attractive: They show progress where the accountable criteria remain silent. The accountable criteria, in turn, get occluded: not the ultimate financial measures (which remain in place) but the intermediate evaluative connection between operational work and those measures, because the evidence organizations use to assess whether accountable criteria are still being met is produced by the engagement. That is proxy seduction: the process by which engagement-elevated proxies displace accountable criteria as the de facto basis of organizational evaluation, while the institutional infrastructure needed to sustain criterion-level assessment is occluded, driven not by cynical gaming but by the sincere and experience-grounded belief that constraining tradeoffs have dissolved.

The distinction from Goodhart’s Law is constitutive, not cosmetic. Goodhart’s Law (Chrystal & Mizen, 2003) describes what happens when people game a performance indicator: The indicator stops being informative because people optimize against it strategically, knowing it no longer tracks what it was designed to track. A Goodhart reading of declining quality alongside rising throughput would say that people found ways to hit the throughput target without doing the underlying work. Proxy seduction says people and organizations are not trying to game anything. The engagement made throughput the most legible indicator available, people followed AI recommendations because the output looked competent, and quality declined without anyone choosing to trade it away. If the problem is gaming, it is a human resources (HR) problem. If the problem is seduction, it is a systemic epistemic trap. Goodhart is also silent on where substitution will be most severe. Proxy seduction identifies knowledge and symbolic work (where outputs are representations and quality is tacit) as more vulnerable than domains where the physical world enforces feedback, as the following material-braking analysis suggests.

Figure 1: How AI engagement constitutes attractive proxy metrics that displace accountable criteria while eroding the judgment that would detect the substitution

The Precondition: Divergence in Kind

Proxy seduction activates when engagement-elevated proxy metrics diverge from accountable criteria. The divergence is not one of degree but of kind: a different observable that the engagement makes more legible than the criterion it displaces. Speed of code generation is not a measure of code robustness. Resolution speed is not a measure of resolution quality. Individual output quality is not a measure of collective diversity. The certainty a simulation projects does not measure what the analysis leaves open. If the divergence were in degree, the proxy would be a noisier version of the criterion, and the fix would be straightforward: Refine the proxy, reduce the noise, and the gap closes. Calibration problems like these are well understood, and the existing metrics-governance literature addresses them. The framework makes a different claim. No amount of refining the proxy brings it closer to the criterion, because the two are not on the same dimension. The divergence is categorical: Speed and robustness, resolution time and resolution quality, individual output quality and collective diversity are different things, not different precisions of the same thing.

The “in kind” precondition has a further consequence: Better measurement makes the problem worse rather than better. Improving a proxy metric’s precision produces a more authoritative-looking measure of the wrong thing. A more precise speed metric is still not measuring robustness, but its precision confers authority.

What AI Engagement Constitutes: The Dual Transformation

Given that the precondition holds, AI engagement constitutes two things simultaneously. The first is proxy elevation. Speed, volume, and certainty become highly visible and legible. These proxy metrics show progress and respond to effort. Meanwhile, the criteria the organization is actually accountable to (whether the code is robust under stress, the contract holds under challenge, the analysis is valid under scrutiny) become harder to see. The accountable criteria have not disappeared. The proxy metrics are so much more legible that the accountable criteria recede into the background.

Different organizations experience different patterns of proxy elevation from the same technology, and the framework must account for this variation. Faulkner and Runde’s (2019) ontology of digital objects provides the theoretical grounding. They establish that the “identity and system functions” of technologies flow from their “social positioning” in the communities that use them, not from intrinsic properties of the artifact. The same digital object can be positioned for different purposes in different organizational contexts. Form underdetermines function: The technical artifact constrains but does not determine which functions are realized. Leonardi (2011) specifies the temporal dimension of this underdetermination: Technologies simultaneously afford and constrain, and which affordances are realized depends on the routines practitioners bring to the engagement. As routines shift through repeated use, different affordances become salient, and the technology’s organizational function changes without any deliberate decision to reposition it. AI’s technical form is what this paper characterizes as unusually fertile: It can be positioned for a wider range of system functions than most prior technologies. For example, two urban planning organizations using the same AI simulation tool positioned it differently, constituting different functions (Leonardi & Leavell, 2026). One maintained provisionality, treating AI simulation outputs as provisional inputs to planning decisions that still required professional judgment and stakeholder deliberation. The other produced what Leonardi and Leavell term “artificial certainty,” presenting simulations as authoritative predictions. The same technical form, positioned differently, produced different patterns of divergence between what the organizations measured and what they were accountable for. This contextual variation has a direct consequence for evaluation: Organizations calibrate their assessment of a technology against the function they believe it performs, and when positioning shifts through accumulated use, the function shifts with it, while the organization’s evaluation criteria remain calibrated to the original function. The positioning can drift through accumulated use without any deliberate organizational decision. Practitioners habituate to new workflows, metrics consolidate around observable outputs, and early successes pull the organization toward deeper engagement.

The second thing AI engagement constitutes is evaluator transformation. The practitioner assessing work after deep AI engagement is not the same evaluator who set the original criteria. Paul’s (2014) account of transformative experience provides the cognitive foundation. Some experiences are epistemically transformative: They teach the person “something she could not have learned without having that kind of experience” (Paul, 2014, p. 11). Some are personally transformative: They reshape the person’s “preferences, desires, and even [their] own sense of who [they are]” (Paul, 2014). Transformative experiences are both. If AI engagement is transformative in Paul’s sense, the practitioner who set pre-engagement criteria is not the practitioner evaluating outcomes post-engagement. And the evaluative continuity assumption (that the pre-engagement and post-engagement evaluator can be treated as the same practitioner applying the same judgment) fails at its foundation.

Anthropic’s internal study of 132 engineers illustrates the structure (Anthropic, 2025). Before deep AI engagement, these engineers evaluated their work through craft-oriented criteria inferred from the study’s qualitative interviews: depth of technical understanding, quality of architectural reasoning, rigor of peer review. Over 12 months, self-reported Claude usage rose from 28% to 59% of daily work, and self-reported productivity gains rose from 20% to 50%. At the same time engineers reported concerns about skill atrophy and described what the study called a “paradox of supervision”: Using AI effectively requires the very skills that may erode through sustained AI use. The practitioners had not abandoned their original evaluative criteria. Engineers articulated the tension between their original criteria and the metrics (speed, volume, breadth) that increasingly structured their work, yet usage and perceived productivity continued to rise. The transformation does not change the criteria the organization is accountable to, but it changes what practitioners attend to when assessing whether those criteria are met.

Table 3: Properties of LLM-based AI agents that intensify proxy seduction

Property	Artifactual Dimension (Technology Form)	Relational Dimension (Organizational Consequence)	Illustrative Evidence
Output mimicry → artificial certainty	AI produces outputs whose form is indistinguishable from competent human work: code that compiles and follows conventions, prose with appropriate register, analyses with proper statistical formatting.	Practitioners read form as a signal of quality (a learned heuristic reliable in pre-engagement practice). Output mimicry generates unearned confidence that the output satisfies the criterion. The inference was reliable before engagement and breaks after it.	Cruces et al. (2026): AI closed 75% of the education-based productivity gap during task execution, but the gap reappeared in full when AI was removed. Leonardi & Leavell (2026): Ocean stakeholders inferred epistemic warrant from simulation output form.
Fertile form generality → proxy surface area	LLM-based AI agents operate on language, the medium through which most knowledge work is conducted, evaluated, and communicated. Compatible with nearly any organizational function involving symbolic work.	Proxy–criterion divergence opens across multiple organizational functions simultaneously (code, contracts, analysis, strategy, documentation, creative work), overwhelming single-domain detection and judgment capacity.	Humlum & Vestergaard (2025): Adoption widespread across 11 AI-exposed occupations. Workday (2026): 85% of employees across functions report time savings. Koren, Békés, Hinz, & Lohmann (2026): Ecosystem-level decoupling across open-source projects.
Rapid capability change → judgment non-consolidation	Model capabilities shift between versions. New tool integrations alter workflows. The engagement surface changes faster than earlier transformative technologies.	Practitioners cannot form settled feedback loops connecting evaluation to consequences because the evaluation target moves before judgment consolidates.	METR (2025b): 5 or more years of codebase experience insufficient to consolidate workflow-level speed judgment under AI engagement, even as code-level quality judgment held.
Substrate overlap	For knowledge work, the production substrate (language, analysis, judgment) and the evaluation substrate are the same. The technology enters the medium through which organizations assess their own performance.	Evaluation cannot stand outside the engagement because evaluation is conducted in the same medium the engagement has entered. Contrast with electricity, where the production substrate changed while the evaluation substrate remained separable.	Brynjolfsson, Li, & Raymond (2025): Expert agents could not reliably assess quality changes in their own AI-mediated interactions. METR (2025b): Developers could not accurately assess their own AI-mediated speed.

Three Dimensions of Evaluative Capacity

Proxy metric elevation and evaluator transformation converge into the central mechanism: proxy seduction. Whether that mechanism produces organizational drift or gets caught depends on the organization’s evaluative capacity, which has three components: detection, judgment stock, and braking.

Detection is whether anyone in the organization registers that the observables it is optimizing against are not the criteria it is accountable to. Detection is epistemic friction: Someone within the organization or connected to it through boundary activity (work that translates between the proxy-level metrics that the organization optimizes for and the criterion-level standards it is accountable to), audit, crisis, or external review registers that proxy metrics and accountable criteria have diverged. The signal need not be precise or well articulated, but if no one registers the divergence, there is nothing to act on.

Detection is not noticing a discrepancy. Noticing is what the popular prescription assumes: better dashboards, clearer signals, more attentive management. Detection requires epistemic confidence, the capacity to trust one’s own judgment when the system’s output projects authority. And epistemic confidence is precisely what sustained AI engagement degrades. AI outputs carry the formal markers of competent work (the code compiles, the prose reads professionally, the analysis deploys the right tests), and those markers have been reliable quality signals throughout practitioners’ careers. Overriding that learned inference requires not just seeing a red flag but believing the red flag is real when the system’s output is saying everything is fine. Brynjolfsson, Li, and Raymond (2025) illustrate the difficulty: top-quintile agents likely possessed the domain knowledge to detect quality decline, yet they increased adherence to AI recommendations over time. Whether they detected the divergence and lacked institutional pathways to act, or whether the authority projected by the system’s formally competent outputs suppressed detection is an empirical question the current evidence does not resolve. The distinction matters because the two explanations require different interventions.

The METR data reveal a subtler form of detection failure. Developers revised their expected speedup downward from 24% to 20% after using AI, a correction that carries the surface markers of epistemic responsibility: The estimate moved in a cautious direction. But the objective outcome was -19%, placing the actual perception–reality gap at 39 percentage points. The visible revision conceals a structural miscalibration. When practitioners revise their estimates in the direction of accuracy but stop far short of it, the act of revision provides social proof of responsible assessment, reducing the likelihood that the residual gap will be interrogated. Partial correction deepens concealment by lending credibility to a still-false estimate. At the organizational level, this pattern scales: An executive who revises projected AI gains from 30% to 20% sounds empirically grounded, and the revision becomes evidence of good organizational judgment, even if the outcome is negligible or negative.

Judgment stock operates only when detection has occurred. Judgment stock is whether the organization contains practitioners and processes that can discriminate proxy from criterion. The source of that discrimination is a feedback loop connecting evaluation to consequences. Practitioners who have lived through the consequences of their decisions (shipped code that broke under stress, contracts that failed under challenge, analyses that misled under scrutiny) develop tacit knowledge about what “good” means in their domain. That knowledge cannot be fully articulated as explicit rules. It is built through the repeated experience of evaluating work, encountering the downstream consequences of that evaluation, and updating judgment accordingly.

The inoculation that judgment provides is specific to the task and workflow contexts where practitioners have faced consequences. The METR developers illustrate this: Code quality (where they had deep consequence exposure) held, while workflow-level speed estimation (newly constituted by the engagement) showed a 39-point perception gap between projected time on task and objectively measured time on task (METR, 2025b). The same structure appears across the evidence base. Brynjolfsson, Li, and Raymond’s (2025) top-quintile customer service agents maintained resolution quality judgment but not adherence judgment. Doshi and Hauser’s (2024) individual writers maintained output-level creativity but could not detect population-level convergence. Practitioners are protected on dimensions their experience directly addresses and exposed on dimensions the engagement newly constitutes.

An organization can have both detection and judgment and still fail to act on them. Braking is the capacity to translate what the organization registers and knows into action that slows proxy seduction long enough for the distinction between proxy and accountable criteria to remain visible. Braking is not stopping progress. It is active, engineered governance: evaluation rituals that assess against accountable criteria, verification loops independent of the AI-mediated production process, and institutional pathways that connect what practitioners detect with how the organization makes decisions. DORA (2025) provides the clearest illustration: High-performing software teams using mature testing and monitoring infrastructure maintained throughput and stability, because pre-existing evaluation infrastructure forced AI output through verification processes calibrated to accountable criteria rather than proxy metrics alone.

Leonardi and Leavell (2026) show that in urban planning, where physical consequences are real and eventual, the difference between proxy capture and preserved judgment depended on three representational choices: the level of visual detail in the model, whether the technology was positioned as an independent authority or as a tool subordinate to professional judgment, and the degree of stakeholder access to model internals. Material consequences existed at both sites. What differed was the legibility of accountable criteria.

An organization can possess detection capacity, judgment stock, and braking potential and still fail to act. Brynjolfsson, Li, and Raymond (2025) provide the strongest evidence for this post-condition. Top-quintile experts increased adherence to AI recommendations over time, even as the researchers’ analysis showed that those recommendations marginally decreased conversation quality for these experts. No institutional pathway connected what those experts knew with how the organization evaluated the engagement. Capacity translates into action only when institutional conditions support it: when evaluation rituals assess against accountable criteria rather than proxy metrics alone, and when practitioners who detect divergence can escalate to decision-makers who act on evaluation.

Erosion of Evaluative Capacity

The three dimensions describe what evaluative capacity consists of. AI engagement erodes that capacity at three levels: practitioner, organizational, and field. The levels reinforce one another, making proxy seduction difficult to escape through any single-level intervention.

Practitioner-Level Erosion

AI engagement degrades the ability to recognize when proxy metrics diverge from accountable criteria through two dynamics. Experienced practitioners lose consequence exposure as the engagement reduces how often, and how deeply, they encounter the downstream effects of their decisions. Anthropic’s own engineers report skill atrophy and a shift to “code reviewer/reviser” roles (Anthropic, 2025), consistent with Beane’s (2019) account of how reduced practice erodes tacit knowledge. Practitioners entering the workforce after AI engagement has been established face a more fundamental problem: The evaluative capacity that proxy seduction erodes may never form in the first place. Shen and Tamkin (2026) demonstrate this in a randomized experiment: AI-assisted developers scored 17% lower on comprehension assessments than unassisted developers, and the interaction patterns that developers reported to be most productive were the ones that prevented learning. Bastani, Bastani, and Sungu (2025) confirm the metacognitive dimension: Students using standard ChatGPT scored 17% worse on unassisted exams while reporting confidence in learning that did not occur. Goldschmidt’s (1991) concept of premature arrest describes how the generative–evaluative dialectic can stop too early. Proxy seduction adds a prior failure mode: For practitioners who enter the workforce under AI engagement, the dialectic may never form. Even experienced practitioners are vulnerable on dimensions the engagement newly constitutes.

Practitioner erosion is compounded because the feedback practitioners receive when engaging AI is not absent but replaced. When a developer uses an AI coding assistant, the code compiles, tests pass, style guidelines are met, and specified cases are handled. The feedback is thorough, immediate, and entirely about explicit, codifiable properties. The silence on tacit properties (robustness under changing requirements, maintainability over time, architectural soundness under conditions the tests did not cover) does not register as missing information. It registers as confirmation that everything is fine. The thoroughness of proxy feedback masks the absence of criterion feedback. AI engagement accelerates the erosion of processes that surface tacit properties by producing outputs with formal completeness that makes the absence of those processes less visible.

Organizational-Level Erosion

Confirming operational experience compounds the erosion. When an organization evaluates AI engagement on speed and throughput, and the engagement produces confirming results on those dimensions (aggregate metrics improving, output volume increasing, stakeholders reporting satisfaction), the evaluation criteria that produced those results are reinforced. The practices that would sustain evaluation against accountable criteria (code review depth, editorial judgment, quality audit rigor) lose funding, staffing, and attention.

The institutional logics perspective (Thornton, Ocasio, & Lounsbury, 2012) explains the mechanism through which this reinforcement operates. Two logics are most directly in contention when AI engagement elevates speed and throughput metrics against craft and judgment-based assessment: market logic, which makes speed, throughput, and scale attractive; and professional logic, which privileges domain-specific quality, craft standards, and judgment-based assessment. Scaling from Paul’s individual-level transformation to organizational-level dynamics requires the organizational evaluator to be specified. Organizational identity (Whetten, 2006) and organizational knowledge (Nag, Corley, & Gioia, 2007) are both constructed from institutional logics: professional logic supplies identity templates and knowledge structures organized around craft, while market logic supplies those organized around throughput.

Weber and Glynn (2006), working within the institutional logics perspective, provide the mechanism whereby individual practitioner experience aggregates into institutional-level dynamics. Institutions shape individual sensemaking through “typifications”: shared schemas that provide practitioners with identities, frames, and action expectations from a “limited register” of available options. Individual practitioners make sense of their experience, and typify patterns, and those typifications feed back into institutional resources for future sensemaking. Organizations track their criteria through metrics calibrated under professional logic. AI engagement elevates a different set of metrics, organized around speed, volume, and throughput, that come to function as stand-ins for the originals. The sensemaking loop then operates on experience that confirms those stand-ins are adequate: aggregate metrics improve, output volume increases, and stakeholders report satisfaction. Learning that fits market logic travels through the typification loop, is institutionalized, and reinforces the logic templates that shape future practice. Learning that does not fit (the craft judgments, quality distinctions, and tacit standards sustained by professional logic) loses its feedback pathway. Professional logic is occluded from the institutional resources needed to sustain criterion-level evaluation.

This asymmetry would be correctable if organizations could detect it. Gioia, Schultz, and Corley (2000) explain why detection fails through what they term “adaptive instability”: organizations maintain stable identity labels while the meaning of those labels shifts. A development team that still calls itself “craftspeople” may no longer mean what it originally meant by “craft.” From inside the organization, the migration from professional to market logic feels like continuity rather than change.

The paper builds on Weber and Glynn’s typification loop and Gioia et al.’s adaptive instability, adding the claim that AI engagement produces the confirming experiences that drive both, and that the loop operates on organizational judgment itself. The occlusion is not a side effect of engagement. It is the mechanism by which engagement reshapes what organizations can evaluate.

Field-Level Erosion

Could field-level dynamics (professional associations, industry benchmarks, cross-organizational comparison) catch what individual organizations miss? MacKenzie’s (2006) and Callon’s (1998) work on Barnesian performativity explain why field-level correction also fails. MacKenzie and Callon argue that economic models and frameworks do not just describe markets. They constitute the markets they claim to describe, reshaping behavior to match the model’s predictions by actively constructing the categories through which markets are evaluated, not by passively reflecting a reality that already exists.

The field-level discourse around AI productivity is performative in this sense: Executive claims do not report findings but constitute the evaluative frame organizations bring to engagement. The discourse operates at two registers. Frontier lab leaders set the evaluative horizon. Anthropic’s CEO predicted that AI could eliminate half of all entry-level white-collar jobs (Amodei, 2025). OpenAI’s CEO declared that “It’ll be very hard to outwork a GPU” (Altman, 2026). These claims are performative in MacKenzie’s sense: They do not report findings but constitute the evaluative categories through which organizations evaluate AI, establishing speed, throughput, and labor substitution as the metrics that matter before any organization has tested whether those metrics track what the organization is accountable for delivering. Deploying organizations then calibrate against that horizon. Through 2025, employers cited AI as the reason for approximately 55,000 US job cuts (Challenger, Gray & Christmas, 2025), and 40% of employers surveyed by the World Economic Forum expected to reduce headcount as a result of AI engagement (WEF, 2025). Whether any specific reduction improved efficiency is not the relevant question. Headcount reduction now circulates as the default demonstration of AI engagement success, performing a proxy (fewer people) into the position of a criterion (organizational efficiency) at the field level. Once that equation is established, any organization evaluating its own engagement inherits a frame in which headcount reduction counts as evidence, foreclosing the question of whether the proxy tracks what the organization is accountable for delivering.

The gap between the performative frame and empirical reality is itself diagnostic. A Yale Budget Lab analysis of Bureau of Labor Statistics data through November 2025 found no significant AI-related labor displacement in occupations with high AI exposure (Gimbel, Kinder, Kendall, & Lee, 2025). IBM’s CEO told The Wall Street Journal in May 2025 that AI had replaced the work of several hundred HR employees (Krishna, 2025). And 9 months later, IBM’s CHRO announced the company would triple entry-level hiring after cutting junior roles collapsed the talent pipeline (LaMoreaux, 2026). The performative frame does not require empirical confirmation to operate, and the 9-month lag between the IBM CEO’s claim and the CHRO’s reversal offers the clearest evidence of that: The frame performed headcount reduction into the position of a criterion, and the proxy–criterion divergence arrived on schedule.

How the Levels Reinforce One Another

The three levels of erosion accelerate one another. Practitioner-level erosion removes the people who could register that proxy metrics and accountable criteria have diverged. Organizational-level erosion removes the evaluation rituals and resources through which such signals would reach decision-makers. Field-level performativity ensures that new organizations entering AI engagement, and new practitioners entering the workforce, begin with an evaluative frame oriented toward the proxy. Each level’s erosion accelerates the other two. Sustained navigation requires active maintenance of evaluative capacity at all three levels simultaneously.

A second reinforcing loop connects organizational evaluation to the market for AI tools. Organizations operating under proxy seduction invest in capabilities that accelerate progress on the dimensions the engagement has made legible. Frontier labs face institutional pressure to optimize along those dimensions, not because they lack professional logic (Anthropic’s own engineers articulate the tension) but because the demand environment is shaped by the same proxy frame the framework describes. Those labs then reinforce the frame through field-level discourse, and the next wave of organizations enters engagement with that frame already in place.

The mechanism is now fully specified: precondition, dual transformation, evaluative capacity dimensions, erosion dynamics, and reinforcing loops across levels. The question is whether the evidence supports or complicates the architecture.

EMPIRICAL CONSTELLATION

The evidence base is a curated empirical constellation drawing on multiple pathways to phenomenon recognition (Fisher, Mayer, & Morris, 2021): data complications that challenge existing assumptions (METR, Humlum & Vestergaard), facilitating cases that prompt theoretical reframing (Leonardi & Leavell, Brynjolfsson et al.), and curious observations from practice (IBM hiring reversal, executive AI claims). No single study tests the full mechanism. Each case maps against specific claims, and the constellation establishes that the pattern is robust across domains and levels of analysis.

The studies in the constellation operate at different levels (individual practitioner, organization, field) and test different components of the mechanism (perception–reality gap, domain specificity of judgment, organizational positioning effects, braking conditions, non-formation of evaluative capacity). The constellation does not trace a single causal chain from individual to field. It establishes that each level operates as the framework specifies, that individual-level gaps compound into organizational drift through the institutional mechanisms described in the preceding section, and that population-level patterns are consistent with the reinforcing dynamics the mechanism describes: Humlum and Vestergaard’s null earnings effects across daily users reporting substantial benefits is consistent with field-level proxy seduction operating at population scale. Table 4 maps each case against the framework dimensions it tests.

Table 4: Empirical constellation: Cases mapped against framework dimensions

Direct Evidence: perception–reality gaps measurable at individual or organizational level. Field-Level Patterns: effects structurally invisible at individual level, emerging only in aggregate. Mechanism Precedent: theoretical lineage from adjacent fields.

Type	Study	Domain	Key Finding	Mechanism	Constellation Role
Direct Evidence	Daniotti et al. (2026)	Software engineering (panel, N=160,097)	Early-career developers 37% AI use, 0% gains; seniors 27% use, 6.2% gains.	Consequence-based judgment enables proxy–criterion discrimination; novices lack the feedback loop to evaluate.	Establishes experience-dependent evaluative capacity.
Direct Evidence	METR (2025b)	Software engineering (RCT, N=16 developers, 246 tasks)	Developers predicted 24% speedup and perceived 20% faster afterward but were 19% slower. Code quality held. 39-point perception–reality gap.	Domain specificity of judgment: Quality (within consequence exposure) held while speed (engagement-constituted dimension) showed full gap.	Gold-standard behavioral evidence of proxy–criterion divergence and domain specificity.
Direct Evidence	Brynjolfsson, Li, & Raymond (2025)	Customer service (field study, N=5,172)	Novices +34%, experts ~0% with quality decline. Experts increased AI adherence over time. Outage showed workers couldn’t revert.	Proxy drift operates even on practitioners with high judgment stock when braking conditions are absent.	Strongest evidence for post-condition (capacity without braking defaults to drift) and expert seduction.
Direct Evidence	Leonardi & Leavell (2026)	Urban planning (comparative case, two organizations)	Same AI simulation tool, different outcomes. Mountain maintained provisionality. Ocean amplified certainty (83% absolute framing).	Organizational positioning determines which proxies become salient. Fertile form underdetermines function.	Cleanest evidence for Faulkner–Runde claim. Extends framework to stakeholder engagement.
Direct Evidence	Humlum & Vestergaard (2025)	Labor market (admin data, full Danish working-age population)	Adoption widespread across 11 exposed occupations. Self-reported time savings 2.8%. Administrative data: null effects on earnings and hours, ruling out changes >2%.	Three-level proxy–criterion gap visible simultaneously: Practitioners report savings, organizations invest in proxy infrastructure, field-level discourse performs closure.	Most direct population-scale evidence of proxy–criterion divergence.
Direct Evidence	Cruces et al. (2026)	Education/productivity (experimental, NBER 34851)	AI closed 75% of education-based productivity gap during task execution. Gap reappeared in full when AI removed.	Competence was scaffolded, not internalized. Proxy (task performance with AI) diverges from criterion (underlying capability).	Demonstrates proxy–criterion divergence at individual competence level. Output mimicry mechanism.
Direct Evidence	Vendraminelli et al. (2025)	Marketing/Tech (field experiment, N=78)	“Wall Effect”: Tech specialists degraded marketing content while believing they improved it.	Domain expertise creates confidence without cross-domain evaluation capacity. Judgment protects within domain, not across.	Shows expertise can increase rather than reduce proxy–criterion miscalibration outside consequence domain.
Direct Evidence	Dell’Acqua et al. (2023)	Management consulting (RCT, N=758)	“Jagged technological frontier”: +40% inside, -19pp outside. Consultants couldn’t identify boundary.	Performance gains on proxy-aligned tasks mask inability to recognize where proxy diverges from criterion.	Establishes frontier invisibility as systematic phenomenon.
Direct Evidence	Workday (2026)	Large enterprises globally (survey, N=3,200)	90%+ confidence; 14% positive outcomes. 37% time saved lost to rework.	Confidence decoupled from outcomes at organizational scale.	Documents proxy–criterion disconnect in enterprise self-assessment.
Direct Evidence	Anthropic (Dec 2025)	AI research (internal report, N=132 engineers)	Self-reported 50% productivity gains. Simultaneous skill atrophy concerns. Work shifted to “70%+ code reviewer/reviser.”	Practitioners articulate the proxy–criterion tension the framework formalizes, yet use and perceived productivity continue to rise.	Boundary activity in real time. Expert vulnerability even among AI researchers.
Direct Evidence	DORA (2025)	Software engineering (survey + qualitative, N≈5,000)	Throughput up, stability down. High-performing teams (mature testing, CI/CD) maintained both. Most confident teams showed lowest verification behavior.	Institutional friction (automated testing, monitoring) shortens proxy–criterion feedback loop. Trust and verification inversely related.	Supports braking claim: Pre-existing evaluation infrastructure differentiates navigation from drift.
Direct Evidence	Shen & Tamkin (2026)	Software engineering (RCT, N=52 professional developers)	AI-assisted developers scored 17% lower on comprehension; largest deficits in debugging. The three interaction patterns that prevented learning were reported as most productive.	Proxy seduction can prevent evaluative capacity from forming, not just erode existing capacity.	Establishes non-formation mechanism. Extends erosion claim from degradation to prevention.
Direct Evidence	Fernandes et al. (2026)	Logical reasoning (experimental, N=698)	+3 points actual, +7 believed improvement. Higher AI literacy = lower metacognitive accuracy.	Knowledge about AI increases confidence faster than it improves calibration. AI literacy paradoxically widens the gap.	Shows that information about AI does not function as braking without consequence exposure.
Field-Level	Doshi & Hauser (2024)	Creative writing (experimental, N=293 writers + 600 evaluators)	Stories rated more creative individually; 10.7% greater semantic similarity collectively. Prompt engineering did not close diversity gap.	Individual proxy (output quality) rises while field-level criterion (collective diversity) degrades. No individual can detect the pattern.	Establishes individual–collective paradox. Structural boundary on detection.
Field-Level	Anderson et al. (2024)	Creative ideation (experimental, N=33, 1,271 ideas)	“Produced less semantically distinct ideas” with ChatGPT; reduced sense of responsibility.	Homogenization at group level while individuals perceive greater creativity.	Replicates Doshi & Hauser’s pattern in ideation domain.
Field-Level	Moon et al. (2025)	College essays (text analysis, N=2,200 essays)	Human essays contributed more new ideas; diversity gap widened at scale.	Cumulative diversity deficit emerges only through population-level analysis.	Demonstrates scale-dependent visibility of proxy–criterion divergence.
Field-Level	Meincke et al. (2025)	Reanalysis of Lee & Chung data	94% AI ideas overlapped; 6% unique vs 100% human-only.	Reanalysis reveals what original individual-level analysis obscured.	Methodological demonstration of hidden homogenization.
Field-Level	Lee & Chung (2024)	Everyday creativity (experimental, five studies)	ChatGPT increased individual creativity ratings; articulation mediated perceived creativity.	Form improvements (articulation) drive proxy perception independent of function (originality).	Source data showing form/function conflation in creativity assessment.
Field-Level	De Freitas, Nave, & Puntoni (2025)	Writing assessment (experimental)	LLMs enhance “form” (articulation, fluency, elaboration) independent of “function.”	Form and function decouple; evaluators respond to form. Proxy (how outputs look) diverges from criterion (what outputs accomplish).	Theoretical grounding for output mimicry and form/function distinction.
Field-Level	Zhao et al. (2025)	LLM creativity assessment (computational, 700 items, 6 models)	LLMs excel at elaboration, fall short on originality.	Systematic capability profile: strong on form (proxy) dimensions, weak on function (criterion) dimensions.	Maps LLM capabilities to form/function framework.
Field-Level	Koren et al. (2026)	Open-source ecosystem (ecosystem analysis)	Tailwind CSS: downloads +80%, revenue -80%. AI agents select components without developer engagement with documentation, bugs, or maintainers.	Usage metrics (proxy) decouple from community engagement (criterion) at ecosystem level.	Field-level performativity: Metrics used to assess ecosystem health are constituted by engagement that does not produce what those metrics were designed to track.
Field-Level	Bastani, Bastani, & Sungu (2025)	Education (field experiment, N~1,000 students)	Standard ChatGPT users scored 17% worse on unassisted exams than controls, while reporting confidence in learning that did not occur.	Metacognitive decoupling at scale: Confidence rises while competence falls, and the divergence is invisible to the learner.	Confirms Shen & Tamkin’s non-formation finding at population scale.
Mechanism Precedent	Bainbridge (1983)	Industrial automation (theoretical)	“Ironies of automation”: more reliable systems = less capable human intervention.	Automation removes practice opportunities needed to maintain judgment. Skills needed only when system fails are the skills disuse erodes.	Foundational irony underlying practitioner-level erosion of judgment stock.
Mechanism Precedent	Endsley (2023)	Human factors (review)	AI opacity and non-determinism intensify ironies; operators trust what they cannot evaluate.	Updates Bainbridge for AI-specific properties: Opacity compounds judgment erosion.	Extends ironies framework to conditions where proxy–criterion gap is structurally harder to detect.
Mechanism Precedent	Simkute et al. (2024)	Software development (qualitative)	“Production-to-evaluation shift”: Makers become judges. Task that develops capacity displaced by task requiring it.	Role transformation eliminates the developmental pathway through which consequence-based judgment forms.	Identifies the mechanism by which engagement erodes judgment stock at the practitioner level.
Mechanism Precedent	Shaw & Nave (2026)	Cognitive psychology (three preregistered experiments, N=1,372, 9,593 trials)	Participants adopted AI outputs on ~80% of faulty trials. Confidence rose ~12pp even with 50% faulty outputs. Per-item incentives and immediate feedback doubled rejection rate.	AI engagement suppresses metacognitive monitoring. Recalibration achievable individually when feedback is immediate and consequences personal.	Supplies cognitive mechanism underlying proxy seduction at practitioner level. Demonstrates individual recalibration is possible.

Industry sources referenced in the main text (BCG 2024, McKinsey 2025, Deloitte 2024) are excluded from this table. These sources measure perception without behavioral verification and do not directly test proxy–criterion divergence.

Two cases require elaboration beyond what the table shows. Brynjolfsson, Li, and Raymond (2025) appear in the evidence for three different claims: judgment claim (the experts possessed discrimination capacity), braking failure (the firm had criterion-level evaluation tied to compensation and drift happened anyway), and the post-condition (experts who may have recognized the divergence still defaulted toward AI adherence without institutional support for acting on that recognition). The firm is simultaneously the strongest evidence for the post-condition and the most challenging case for the braking claim. Leonardi and Leavell (2026) offer the cleanest evidence for the Faulkner–Runde claim that organizational positioning determines which proxies become salient, extending the framework’s scope to stakeholder engagement where subject-matter knowledge was high but tool-epistemics knowledge was low.

The perception–reality gap is robust across multiple studies, domains, and levels of analysis. Consequence-based judgment modulates outcomes in the predicted direction. Organizational positioning of the same technology produces different outcomes. Expert practitioners can be seduced despite having the judgment to resist. Institutional friction consistently differentiates navigation from drift.

No study tests the full sequential mechanism, from performativity, through fertile form, through evaluator transformation, through logic lock-in. No longitudinal study traces the Weber–Glynn loop operating over time. The judgment erosion claim rests on cross-sectional evidence. Clean distinction from simpler explanations (optimism bias, Dunning–Kruger, cognitive load) is not established in every case, though the cross-domain pattern and the domain specificity of judgment are difficult for these simpler explanations to accommodate.

The METR developers had maximal domain expertise and still showed the 39-point gap. The framework resolves this through domain specificity (the gap was on speed, not quality), but confirmation requires studies designed to separate the two dimensions. The Brynjolfsson experts had both detection and judgment and still defaulted to the proxy. The framework resolves this through the post-condition (capacity without braking defaults to drift), but a cleaner test would isolate judgment from braking conditions. The boundary between productive AI expansion and proxy seduction is an empirical question at the level of specific engagements, not a categorical distinction.

FALSIFICATION CONDITIONS

The core pattern is empirically robust: Organizations consistently perceive AI-driven gains that systematic measurement does not confirm, and the gap resists expertise. The pattern is established, including the precondition that proxy–criterion divergence is categorical rather than a matter of degree. Four conditions test whether the variables the mechanism specifies operate as it claims. For each, the condition states what should hold if the mechanism operates as specified, what would disconfirm it, what the evidence shows, and what the empirical protocol can test.

Experienced Practitioners Should Detect Proxy–Criterion Divergence Where They Have Experience

Experienced practitioners should show smaller perception–reality gaps on dimensions where they have consequence exposure, and larger gaps on dimensions that the engagement newly constitutes. If experienced practitioners show gaps of the same magnitude as inexperienced practitioners on tasks the experienced practitioners know well, the framework’s judgment claim is wrong. This condition tests judgment, not braking. If organizational mandates or institutional pressure prevent practitioners from acting on what they detect, the gap may persist for reasons the judgment claim does not address. A valid test requires practitioners to have the institutional conditions to exercise their judgment. Daniotti’s experience gradient and METR’s quality-holding result are both supportive. Brynjolfsson, Li, and Raymond’s (2025) top-quintile experts complicate the picture: They had the judgment but increased adherence to AI recommendations as quality declined. Whether they detected the decline and lacked institutional pathways to act on it, or whether the authority projected by the system’s formally competent outputs suppressed detection itself are two different explanations with different implications for the framework. No study in the current evidence base distinguishes between them. A cleaner test would hold braking conditions constant and measure whether judgment alone predicts discrimination.

Deliberate Criterion-Level Evaluation Should Slow Proxy Drift

Organizations maintaining practices that connect evaluation to accountable criteria should show less proxy–criterion divergence than organizations without such practices. If deliberate criterion-level evaluation produces no measurable difference in divergence, the braking claim is wrong. The test must also distinguish cause from correlation: If the effect appears only in organizations already disposed to brake, the evaluation practices are not what slowed the drift. The evidence is limited and partially disconfirming. The Brynjolfsson firm maintained criterion-level evaluation tied to compensation, and drift happened anyway. Leonardi and Leavell’s Mountain case shows that constraining representational choices kept stakeholders treating projections as uncertain rather than settled. But the study tests organizational positioning of the technology, not whether deliberate criterion-level evaluation practices slow drift. No study in the current evidence base tests whether deliberate criterion-level evaluation, introduced as an organizational practice, slows proxy drift.

How Organizations Talk About Good Work Should Narrow Over Time Under Sustained AI Engagement

Sustained AI engagement should produce observable narrowing of evaluative vocabulary and practice, as criteria migrate from professional logic (robustness, originality, resolution quality) toward market logic (velocity, throughput, handle time, output volume). The narrowing should be self-reinforcing: It should not reverse spontaneously without intervention or crisis. If organizations with sustained AI engagement show no such narrowing, the institutional feedback claim is wrong. Spontaneous reversal without intervention or crisis would disconfirm the self-reinforcing claim specifically. This is the weakest empirical area: No longitudinal study traces the loop. Cross-sectional evidence is consistent (Brynjolfsson’s experts shifting toward AI adherence, DORA’s low-performing teams losing stability while gaining throughput) but not designed to measure vocabulary or practice narrowing over time.

Proxy Drift Should Persist Even When Practitioners Recognize the Divergence

Because proxy drift operates through organizational and field-level dynamics (not just through individual misperception), the drift should persist even when practitioners recognize the divergence. Staffing, funding, and evaluation infrastructure should gradually shift toward the proxy regardless of individual awareness. Three practitioner orientations bear on this test: gaming (Goodhart), sincere belief (individual-level proxy seduction), and recognition of the divergence alongside pragmatic navigation of organizational expectations. In all three cases the mechanism entails organizational-level drift. If the perception–reality gap disappears whenever practitioners recognize the divergence, then Goodhart’s Law or a simple organizational mandate explains the pattern, and proxy seduction is unnecessary as a distinct mechanism. METR provides the strongest evidence for sincerity: Developers had no incentive to overstate AI-driven gains, yet the 39-point gap persisted. No study directly tests whether organizational-level drift persists when the full practitioner population recognizes the divergence, which is what the framework’s explanatory scope, addressed next, would require.

DISCUSSION

Explanatory Scope

The PSF explains why perception–reality gaps in AI engagement are patterned rather than random, why expertise protects on some dimensions but not others, why the same technology produces different outcomes in different organizational contexts, why field-level discourse amplifies rather than corrects proxy substitution, why braking requires institutional support rather than individual judgment alone, and why organizational and field-level dynamics produce proxy drift even when individual practitioners recognize the divergence.

The framework does not explain the rate at which proxy substitution deepens under given conditions. Which specific proxies a given engagement will constitute also falls outside its scope, because fertile form underdetermines function and organizational positioning contextually resolves that underdetermination. Why some organizations brake and others do not given similar capacity remains an empirical question the interview protocol is designed to address.

The J-Curve and Substrate Separability

The framework’s most likely objection comes from the prevailing account in the productivity literature. David’s (1990) productivity J-curve and Brynjolfsson’s contemporary application of it to AI (Brynjolfsson, 2026) address when aggregate productivity gains from general-purpose technologies appear. The PSF addresses whether the evaluative frame through which organizations assess those gains is reliable. The two are complementary, not competing.

The distinction turns on substrate separability: whether the technology’s production substrate and the organization’s evaluation substrate are the same or different. Electricity transformed the production substrate while leaving the evaluation substrate intact. Factory owners could evaluate electrified production using pre-electrification criteria because financial performance measures were unaffected by how goods were physically produced. For knowledge work mediated by LLM-based AI agents, the production substrate and the evaluation substrate overlap: Organizations produce work in language, and evaluate work in language, and the technology operates on language. The J-curve framework is silent on what happens to evaluative capacity when the technology and the evaluation share a substrate. Both frameworks can be true simultaneously: Organizations may need time to reorganize (J-curve), and the evaluation of that reorganization may be systematically distorted (PSF). The empirical question, testable with follow-up data from studies like Humlum and Vestergaard, is whether the earnings pass-through gap narrows as engagement matures (as the J-curve would expect) or persists and widens (as proxy seduction would expect when proxy optimization deepens).

The partial-correction pattern described in the detection analysis has a direct bearing on this empirical question. All three competing explanations (Goodhart’s Law, the J-curve, workflow transformation) assume some intact feedback channel: Goodhart assumes organizations can catch gaming, the J-curve assumes organizations can detect the turn, and workflow transformation assumes organizations can validate the restructuring. The METR evidence shows the feedback channel appears to be working (practitioners visibly revise estimates, organizations adjust projections) while delivering systematically misleading information. The channel’s apparent functionality is what makes its unreliability dangerous.

Implications for Practice

Whether proxy seduction better explains the trajectory is an empirical question that longitudinal data will resolve. But if the erosion claim holds, the evaluative capacity that organizations need to act on that resolution degrades while they wait. The three dimensions of evaluative capacity (detection, judgment stock, braking) can be assessed now for any specific AI engagement. Organizations can ask whether anyone within the organization is registering proxy–criterion divergence, whether practitioners with consequence-based judgment have institutional pathways to act on what they register, and whether evaluation rituals include criterion-level assessment rather than proxy-level assessment alone.

The most important organizational investment the framework suggests is not in AI capability but in criterion-level evaluation infrastructure: the practices, roles, and institutional resources that maintain the distinction between what AI engagement makes legible and what the organization is accountable for producing. This means verification processes independent of the AI-mediated production loop: code review by practitioners who build the system, quality audits calibrated to accountable criteria, and evaluation rituals that assess against downstream consequences rather than upstream indicators.

Boundary activity emerges as a practice category that organizations can deliberately cultivate. The framework suggests that boundary activity is more effective when it operates against, rather than alongside, the dominant institutional logic. The practical corollary is that boundary activity requires institutional protection (funding, authority, organizational visibility) precisely because it generates friction that market logic consistently deprioritizes.

Implications for Research

Sequencing matters: Frontline practitioners should be interviewed first to surface criteria shift without researcher framing, followed by boundary actors to surface the translation work they perform. Three additional research directions follow from the framework. Longitudinal studies are needed to test the evaluative narrowing claim. Cross-domain comparison between materially constrained domains and knowledge work domains is needed to test the material-braking claim: Proxy seduction should appear on the symbolic dimensions of materially constrained work while the material dimensions resist, but the claim has not been tested with matched engagement intensity. Within-practitioner studies separating quality judgment from speed judgment (on the METR model) are needed to confirm domain specificity.

Implications for Policy

Regulatory frameworks that mandate criterion-level evaluation (not just proxy-level reporting) would function as institutional braking in the framework’s terms. Professional bodies that maintain consequence-based certification standards would function as judgment preservation infrastructure. Both should prove more effective than technology-specific regulation, because they address the evaluation mechanism rather than the technology itself.

CONCLUSION

Organizations engaging with AI face a systematic pattern: The technology constitutes attractive metrics (speed, volume, certainty) that diverge categorically from the criteria those organizations are accountable to (robust code, defensible contracts, valid diagnoses, substantive work). Practitioners optimize against these engagement-constituted proxies, sometimes through a sincere belief that constraining tradeoffs have dissolved, and sometimes while navigating the proxy as an organizational requirement alongside criterion-level work. In both cases the judgment that would catch the substitution erodes through the engagement itself, as feedback loops connecting evaluation to consequences are severed, shortened, or replaced by proxy feedback that provides thorough confirmation on the wrong dimensions.

The PSF makes this pattern theoretically legible by integrating four established resources into a causal architecture that explains what no individual resource can: why proxy–criterion divergence is constitutive rather than incidental, why the divergence is invisible from inside the engagement, why institutional dynamics reinforce rather than correct it, and why field-level discourse pre-constitutes the evaluative frame before any organization tests its operational validity. The framework is diagnostic: It identifies the precondition, the modulating variables (detection, judgment stock, braking), and the reinforcing dynamics across practitioner, organizational, and field levels.

The evidence base supports the framework’s claims across domains, skill levels, and levels of analysis. The perception–reality gap is robust and resists expertise on engagement-constituted dimensions. Organizational positioning determines which proxies become salient. The framework’s strongest empirical claim is that proxy drift operates through organizational and field-level dynamics that persist even when individual practitioners recognize the divergence, which distinguishes it from Goodhart’s Law and makes it harder to address through conventional metric-governance approaches.

The pattern the framework identifies will not resolve when AI capabilities mature or when organizations develop more sophisticated governance. It deepens wherever organizations engage with a technology that operates on the same substrate as their evaluation. What becomes possible with the framework in hand is not prediction but diagnosis: Organizations can now ask the right questions before evaluative capacity degrades to the point where they lack the judgment to act on the answers. For organizational theory, the framework opens a research program in which the evaluating subject is not treated as stable through transformative technological engagement. For practice, it points to a category of institutional investment, criterion-level evaluation infrastructure, that has no analogue in the AI capability literature. Whether that investment is made before or after the evaluative capacity needed to recognize its importance has eroded is the question the next decade of AI engagement will answer.

REFERENCES

Altman, S. 2026. Keynote address, India AI Impact Summit 2026, New Delhi, February 19.

Amodei, D. 2025. Interview with Axios, May 28. Available at https://www.axios.com/2025/05/28/ai-jobs-white-collar-unemployment-anthropic

Anderson, B. R., Shah, J. H., & Kreminski, M. 2024. Homogenization effects of large language models on human creative ideation. ACM Conference on Creativity & Cognition: 413-425.

Anthropic. 2025. How AI is transforming work at Anthropic. Available at https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic

Bainbridge, L. 1983. Ironies of automation. Automatica, 19(6): 775-779.

Bastani, H., Bastani, O., & Sungu, A. 2025. Generative AI without guardrails can harm learning: Evidence from high school mathematics. Proceedings of the National Academy of Sciences, 122(26): e2422633122.

BCG. 2024. The 10-20-70 framework for AI transformation. Boston, MA: Boston Consulting Group.

Beane, M. 2019. Shadow learning: Building robotic surgical skill when approved means fail. Administrative Science Quarterly, 64(1): 87-123.

Brynjolfsson, E. 2026. The AI productivity take-off is finally visible. Financial Times, February 15.

Brynjolfsson, E., Rock, D., & Syverson, C. 2017. Artificial intelligence and the modern productivity paradox: A clash of expectations and statistics. NBER Working Paper No. 24001, National Bureau of Economic Research, Cambridge, MA.

Brynjolfsson, E., Rock, D., & Syverson, C. 2021. The productivity J-curve: How intangibles complement general purpose technologies. American Economic Journal: Macroeconomics, 13(1): 333-372.

Brynjolfsson, E., Li, D., & Raymond, L. 2025. Generative AI at work. Quarterly Journal of Economics, 140(2): 889-942.

Callon, M. 1998. Introduction: The embeddedness of economic markets in economics. In M. Callon (Ed.), The laws of the markets: 1-57. Oxford: Blackwell.

Challenger, Gray & Christmas. 2025. 2025 year-end job cuts report. Chicago, IL: Challenger, Gray & Christmas.

Chrystal, K. A., & Mizen, P. D. 2003. Goodhart’s Law: Its origins, meaning and implications for monetary policy. In P. Mizen (Ed.), Central banking, monetary theory and practice: Essays in honour of Charles Goodhart, vol. 1: 221-243. Cheltenham: Edward Elgar.

Cruces, G., Fernandez Meijide, D., Galiani, S., Galvez, R. H., & Lombardi, M. 2026. Does generative AI narrow education-based productivity gaps? Evidence from a randomized experiment. NBER Working Paper No. 34851, National Bureau of Economic Research, Cambridge, MA.

Daniotti, S., Wachs, J., Feng, X., & Neffke, F. 2026. Who is using AI to code? Global diffusion and impact of generative AI. Science, 391(6787):831-835.

David, P. A. 1990. The dynamo and the computer: An historical perspective on the modern productivity paradox. American Economic Review, 80(2): 355-361.

De Freitas, J., Nave, G., & Puntoni, S. 2025. Ideation with generative AI. Journal of Consumer Research, 52(1): 18-31.

Dell’Acqua, F., McFowland, E., Mollick, E. R., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K. R. 2023. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Working Paper No. 24-013, Harvard Business School, Boston, MA.

Deloitte AI Institute. 2024. The state of generative AI in the enterprise: Now decides next. Available at https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-generative-ai-in-enterprise.html

DORA. 2025. Accelerate state of DevOps report 2025. Mountain View, CA: Google Cloud.

Doshi, A. R., & Hauser, O. P. 2024. Generative AI enhances individual creativity but reduces the collective diversity of novel content. Science Advances, 10(28): eadn5290.

Eloundou, T., Manning, S., Mishkin, P., & Rock, D. 2024. GPTs are GPTs: Labor market impact potential of LLMs. Science, 384(6702): 1306-1308.

Endsley, M. R. 2023. Ironies of artificial intelligence. Ergonomics, 66(11): 1656-1668.

Faulkner, P., & Runde, J. 2019. Theorizing the digital object. MIS Quarterly, 43(4): 1279-1302.

Fernandes, D., Villa, S., Nicholls, S., Haavisto, O., Buschek, D., Schmidt, A., Kosch, T., Shen, C., & Welsch, R. 2026. AI makes you smarter but none the wiser: The disconnect between performance and metacognition. Computers in Human Behavior, 175: 108779.

Fisher, G., Mayer, K., & Morris, S. 2021. Phenomenon-Based Theorizing. Academy of Management Review, 46(4): 631-639.

Gimbel, M., Kinder, M., Kendall, J., & Lee, M. 2025. Evaluating the impact of AI on the labor market: Current state of affairs. The Budget Lab at Yale. Available at https://budgetlab.yale.edu/research/evaluating-impact-ai-labor-market-current-state-affairs (updated through December 2025).

Gioia, D. A., Schultz, M., & Corley, K. G. 2000. Organizational identity, image, and adaptive instability. Academy of Management Review, 25(1): 63-81.

Goldschmidt, G. 1991. The dialectics of sketching. Creativity Research Journal, 4(2): 123-143.

Humlum, A., & Vestergaard, E. 2025. Large language models, small labor market effects. NBER Working Paper No. 33777, National Bureau of Economic Research, Cambridge, MA.

Koren, M., Békés, G., Hinz, J., & Lohmann, A. 2026. Vibe coding kills open source. arXiv: 2601.15494.

Krishna, A. 2025. Interview with The Wall Street Journal, May 5.

LaMoreaux, N. 2026. Interview with Fortune, February 13.

Lee, B. C., & Chung, J. 2024. An empirical investigation of the impact of ChatGPT on creativity. Nature Human Behaviour, 8(10): 1906-1914.

Leonardi, P. M. 2011. When flexible routines meet flexible technologies: Affordance, constraint, and the imbrication of human and material agencies. MIS Quarterly, 35(1): 147-167.

Leonardi, P. M., & Leavell, V. 2026. Knowing enough to be dangerous: The problem of “artificial certainty” for expert authority when using AI for decision making and planning. Organization Science, Articles in Advance: 1-28. doi:10.1287/orsc.2023.18224.

MacKenzie, D. 2006. An engine, not a camera: How financial models shape markets. Cambridge, MA: MIT Press.

Manning, S., & Aguirre, T. 2026. How adaptable are American workers to AI-induced job displacement? NBER Working Paper No. 34705, National Bureau of Economic Research, Cambridge, MA.

Massenkoff, M., & McCrory, P. 2026. Labor market impacts of AI: A new measure and early evidence. Available at https://www.anthropic.com/research/labor-market-impacts

McKinsey & Company. 2025. The state of AI in early 2025: Gen AI adoption spikes and starts to generate value. Available at https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

Meincke, L., Nave, G., & Terwiesch, C. 2025. ChatGPT decreases idea diversity in brainstorming. Nature Human Behaviour, 9: 1107-1109.

Messeri, L. & Crockett, M. J. 2024. Artificial intelligence and illusions of understanding in scientific research. Nature, 627(8002): 49-58.

METR. 2025a. Measuring AI ability to complete long tasks. arXiv: 2503.14499. Available at https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ (updated January 2026).

METR. 2025b. Measuring AI impact on developer productivity. Available at https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

Moon, K., Green, A. E., & Kushlev, K. 2025. Homogenizing effect of large language models (LLMs) on creative diversity: An empirical comparison of human and ChatGPT writing. Computers in Human Behavior: Artificial Humans, 6: 100207.

Nag, R., Corley, K. G., & Gioia, D. A. 2007. The intersection of organizational identity, knowledge, and practice. Academy of Management Journal, 50(4): 821-847.

Orlikowski, W. J. 2007. Sociomaterial practices: Exploring technology at work. Organization Studies, 28(9): 1435-1448.

Paul, L. A. 2014. Transformative experience. Oxford: Oxford University Press.

Sandberg, J., & Alvesson, M. 2011. Ways of constructing research questions: Gap-spotting or problematization? Organization, 18(1): 23-44.

Shaw, S. D., & Nave, G. 2026. Thinking: Fast, slow, and artificial: How AI is reshaping human reasoning and the rise of cognitive surrender. Working paper, The Wharton School, University of Pennsylvania, Philadelphia, PA.

Shen, J. H. & Tamkin, A. 2026. How AI impacts skill formation. arXiv: 2601.20245.

Simkute, A., Tankelevitch, L., Kewenig, V., Scott, A. E., Sellen, A., & Rintel, S. 2024. Ironies of generative AI: Understanding and mitigating productivity loss in human-AI interaction. International Journal of Human–Computer Interaction: 1-22.

Teece, D. J. 2007. Explicating dynamic capabilities: The nature and microfoundations of (sustainable) enterprise performance. Strategic Management Journal, 28(13): 1319-1350.

Thornton, P. H., Ocasio, W., & Lounsbury, M. 2012. The institutional logics perspective: A new approach to culture, structure, and process. Oxford: Oxford University Press.

Vendraminelli, L., Iansiti, M., Lakhani, K. R., & Menietti, M. 2025. The GenAI wall effect. Working Paper No. 26-011, Harvard Business School, Boston, MA.

Weber, K., & Glynn, M. A. 2006. Making sense with institutions: Context, thought and action in Karl Weick’s theory. Organization Studies, 27(11): 1639-1660.

WEF. 2025. The future of jobs report 2025. Geneva: World Economic Forum.

Whetten, D. A. 2006. Albert and Whetten revisited: Strengthening the concept of organizational identity. Journal of Management Inquiry, 15(3): 219-234.

Workday. 2026. Beyond productivity: Measuring the real value of AI. Available at https://www.workday.com/en-us/reports/beyond-productivity-ai-value

Zhao, Y., Zhang, R., Li, W., & Li, L. 2025. Assessing and understanding creativity in large language models. Machine Intelligence Research, 22: 417-432.

proxy seduction, evaluative capacity, proxy–criterion divergence, institutional logics, transformative experience, AI engagement, performativity

Evidence data: Anthropic/EconomicIndex (HuggingFace, CC-BY). Task statements from U.S. Department of Labor O*NET database. Usage percentages from Handa et al. (2025). Theoretical exposure from Massenkoff & McCrory (2026) / Eloundou et al. (2023).
Prepared by: Vikram Bapat, Institute for Manufacturing, University of Cambridge. March 2026.