Organizations engaging with AI consistently perceive gains that systematic measurement fails to confirm. The gap is not noise. It is patterned (perception consistently exceeds measurement), persistent (it holds across skill levels and domains), and resistant to expertise (experienced practitioners show it on dimensions the engagement newly constitutes, even where their domain judgment remains intact).
Each level reveals a version of the gap that the actors at the previous level are structurally incapable of detecting. The developer cannot see their own slowdown. The organization cannot see the expert degradation hiding behind aggregate improvement. No writer evaluating their own output can see population-level convergence. Something structural is producing this pattern.
The paper's title is its argument in fifteen words. Every word does specific theoretical work.
Real gains, real metrics, sincere belief. Organizations are not wrong about what they observe. The problem is that the ground itself has been reconstituted by the engagement: what counts as progress changed. The gains are genuine, just not gains on the dimensions the organization is accountable to.
Not looking away. The instrument of evaluation has been reshaped by the engagement itself. Organizations lose sight because what they see with has changed, not because they stopped looking. The gaining and the losing are not sequential. They are the same process: proxy elevation and evaluative erosion are constitutively linked.
Not a one-time adoption decision. The mechanism operates through sustained work with the technology. The transformation happens in the doing, not at the moment an organization decides to use AI.
Not "causes" or "reveals." The engagement brings these metrics into existence as legitimate measures of organizational performance. Before engagement, speed and volume were observable but subordinate. After engagement, they are what evaluation runs on. The frame is made, not found.
Speed, volume, apparent certainty. These do not pre-exist the engagement in their current form. The engagement constitutes them as the most legible, responsive indicators available. They measure real properties, but real is not the same as accountable.
Gradual, self-reinforcing, invisible from inside. Not a discrete failure but a progressive loss of evaluative capacity. Failure is self-announcing. Degradation is self-concealing. Proxy seduction works because it looks like progress.
The unit of analysis is the organization, not the practitioner. Individual practitioners can recognize the divergence and still not arrest the drift. The erosion operates through institutional feedback loops that persist regardless of individual awareness.
Proxy seduction is the process by which engagement-elevated proxy metrics (speed, volume, certainty) displace accountable criteria (robustness, validity, defensibility) as the basis of organizational evaluation.
Practitioners genuinely believe tradeoffs have dissolved. The substitution is not gaming. It is invisible from inside the engagement.
The technology produces the metrics that drive the substitution. The proxy metrics do not pre-exist the engagement. The engagement makes them.
The engagement erodes the capacity that would catch the substitution. Each cycle deepens the trap.
Each section below unpacks one component, moving from precondition through engagement, dual transformation, evaluative capacity, and three-level erosion.
The mechanism only activates when the proxy metric and the accountable criterion diverge in kind, not in degree. Speed and robustness are different things, not different precisions of the same thing. If the divergence were in degree, the proxy would be a noisier version of the criterion, and the fix would be straightforward: refine the measurement, reduce the noise, and the gap closes. Calibration problems like these are well understood.
The framework makes a different claim. No amount of refining the proxy brings it closer to the criterion, because the two are not on the same dimension. The divergence is categorical.
| Proxy metric (legible) | Accountable criterion (occluded) | Evidence |
|---|---|---|
| Speed of code generation | Code robustness under stress | METR |
| Resolution speed | Resolution quality | Brynjolfsson et al. |
| Individual output quality | Collective diversity | Doshi & Hauser |
| Simulation certainty | Epistemic provisionality | Leonardi & Leavell |
| Task completion during AI use | Competence retained after AI removed | Cruces et al. 2026 |
Better measurement makes it worse, not better. A more precise speed metric is still not measuring robustness, but its precision confers authority. The problem is not measurement quality. The problem is measurement direction.
The evaluative continuity assumption (that the people evaluating a technology's impact remain competent judges through the engagement) could fail under any transformative technology. Three properties of AI explain why it fails here. Faulkner & Runde, 2019
AI produces outputs whose form is indistinguishable from competent human work. The formal markers of earned competence (confidence, completeness, procedural correctness) arrive without the epistemic process that would produce them. Practitioners infer that the underlying work has improved correspondingly, because the improvement in form is mistaken for improvement in substance. Cruces et al. (2026) close the loop experimentally: AI closed 75% of the education-based productivity gap during task execution, but the gap reappeared in full when AI assistance was removed. The competence was scaffolded, not internalized.
Most technologies constitute proxies within a bounded domain. A CRM generates sales metrics. An ERP generates operational metrics. AI operates on language, the medium through which most knowledge work is conducted, evaluated, and communicated. The proxies AI elevates (speed, volume, throughput) appear across most organizational functions, and their pervasiveness is part of what makes them look like the right metrics rather than convenient substitutes.
Earlier transformative technologies stabilized long enough for practitioners to develop feedback loops connecting evaluation to consequences. AI capabilities change roughly every seven months by recent estimates. The technology shifts what the engagement involves before practitioners consolidate their assessment of the previous version. Model capabilities shift between versions, new tool integrations alter workflows, and the evaluation target moves while the evaluator is still forming judgment about where it was.
Organizational positioning determines which of these properties become salient, which is why blanket claims about AI fail. The same technology produces different proxy-criterion gaps depending on how the organization positions it. AI's technical form is what the paper characterizes as unusually fertile: it can be positioned for a wider range of system functions than most prior technologies. Form underdetermines function, and organizational positioning resolves the underdetermination. Faulkner & Runde, 2019
Given the precondition, AI engagement constitutes two things simultaneously.
Speed, volume, and certainty become highly visible and legible. These proxy metrics show progress and respond to effort. Meanwhile, the criteria the organization is actually accountable to (whether the code is robust under stress, whether the contract holds under challenge, whether the analysis is valid under scrutiny) become harder to see. The accountable criteria have not disappeared. The proxy metrics are so much more legible that the accountable criteria recede into the background.
Two urban planning organizations used the same AI simulation tool. One constrained visual detail, positioned the tool as one input among several, and gave stakeholders access to model internals. Their stakeholders maintained provisional language about 20-year projections. The other amplified all three dimensions. Their stakeholders treated 20-year projections as settled fact, with 83% of statements framed as absolutes. Same technology, different organizational positioning, different proxy-criterion gaps. Leonardi & Leavell, 2026
The practitioner assessing work after deep AI engagement is not the same evaluator who set the original criteria. Paul's (2014) account of transformative experience provides the cognitive foundation: some experiences are epistemically transformative (they teach the person something that could not be learned without the experience) and personally transformative (they reshape preferences and self-understanding). If AI engagement is transformative in this sense, the practitioner who set pre-engagement criteria is not the practitioner evaluating outcomes post-engagement. Paul, 2014
Anthropic's internal study of 132 engineers illustrates the structure. Over 12 months, self-reported Claude usage rose from 28% to 59% of daily work, and self-reported productivity gains rose from 20% to 50%. Engineers simultaneously reported concerns about skill atrophy and described a "paradox of supervision": using AI effectively requires the very skills that may erode through sustained AI use. The practitioners articulated the tension between their original criteria and the metrics that increasingly structured their work. Usage and perceived productivity continued to rise anyway. Anthropic, 2025
Proxy metric elevation and evaluator transformation converge into the central mechanism: proxy seduction. Proxy metrics displace accountable criteria as the actual basis of organizational evaluation. Organizations are not gaming their metrics. Practitioners sincerely believe the tradeoffs that used to constrain their work have dissolved. The engagement made the proxy metric the most legible indicator available, the output looked competent, and quality shifted without anyone choosing to trade it away.
Whether proxy seduction produces organizational drift or gets caught depends on the organization's evaluative capacity. The three dimensions are the framework's original contribution, not borrowed from any of the four theoretical resources.
Does anyone in the organization register that the observables it is optimizing against are not the criteria it is accountable to? Detection is not noticing. The popular prescription (better dashboards, clearer signals, more attentive management) assumes noticing is the bottleneck. Detection requires epistemic confidence: trusting your own judgment when the system's output projects authority. AI outputs carry formal markers of competent work, and those markers have been reliable quality signals throughout practitioners' careers. Overriding that learned inference takes more than seeing a red flag. It takes trusting yourself over the machine when the machine's output looks professional.
The METR data reveal a subtler form. Developers revised their expected speedup downward from 24% to 20% after working, a correction carrying the surface markers of epistemic responsibility. But the objective outcome was -19%, leaving a 39-point gap. Partial correction deepens concealment by lending credibility to a still-false estimate. At the organizational level, an executive who revises projected AI gains from 30% to 20% sounds empirically grounded, and the revision becomes evidence of good judgment, even if the actual outcome is negligible or negative.
Can anyone discriminate proxy from criterion? The source of that discrimination is consequence exposure: practitioners who have shipped code that broke, written contracts that failed, or produced analyses that misled under scrutiny develop tacit knowledge about what "good" means in their domain. The inoculation is domain-specific. The METR developers held quality judgment (where they had deep consequence exposure) while failing on speed estimation (newly constituted by the engagement). Same developers, same tasks, different dimensions, different outcomes.
The same structure appears across the evidence base. Brynjolfsson's top-quintile agents maintained resolution quality judgment but not adherence judgment. Doshi and Hauser's writers maintained individual output quality but could not detect population-level convergence. Practitioners are protected where their experience gives direct access to the accountable criterion, and exposed on dimensions the engagement newly constitutes.
Can the organization translate detection and judgment into action? Braking is not stopping progress. It is active, engineered governance: verification processes independent of the AI-mediated production loop, evaluation rituals assessing against accountable criteria, and institutional pathways connecting what practitioners detect with how the organization makes decisions. DORA (2025) high-performing software teams maintained both throughput and stability because pre-existing evaluation infrastructure forced AI output through verification calibrated to accountable criteria. DORA, 2025
An organization can possess detection, judgment stock, and braking potential and still fail. Brynjolfsson's firm had all three capacities. No institutional pathway connected what the experts knew to how the organization evaluated the engagement. Capacity without institutional pathways defaults to drift.
Each level's erosion accelerates the other two.
Two dynamics operate. Experienced practitioners lose consequence exposure through disuse as AI handles more of the work that used to build their judgment. Practitioners entering the workforce after AI engagement is established face a more fundamental problem: the evaluative capacity may never form in the first place. Shen and Tamkin (2026) demonstrate the mechanism in a randomized experiment with professional software developers: AI-assisted developers scored 17% lower on comprehension, with the largest deficits in debugging, the very skill required to validate AI output. The interaction patterns that developers reported as most productive (pure delegation, progressive reliance, iterating through the AI rather than through the problem) were precisely the patterns that prevented learning. Shen & Tamkin, 2026
Bastani, Bastani, and Sungu (2025) confirm the metacognitive dimension: learners using standard AI assistance reported confidence in learning that did not occur. Bastani et al., 2025
Compounding both erosion and non-formation, the feedback practitioners receive from AI-assisted work is not absent but replaced. The code compiles, the tests pass, the style guidelines are met. The feedback is thorough, immediate, and entirely about explicit, codifiable properties. The silence on tacit properties (robustness under changing requirements, maintainability by other developers over time, architectural soundness under conditions the tests did not cover) does not register as missing information. The silence registers as confirmation that everything is fine.
Confirming experience compounds the erosion. When an organization evaluates AI engagement on speed and throughput, and the engagement produces confirming results on those dimensions, the evaluation criteria that produced those results get reinforced. The practices that would sustain evaluation against accountable criteria (code review depth, editorial judgment, quality audit rigor) lose funding, staffing, and attention. Thornton, Ocasio, & Lounsbury, 2012
The institutional logics perspective explains the mechanism. Market logic propagates through Weber and Glynn's (2006) typification loop: proxy-confirming experiences accumulate into organizational routines. Professional logic loses its feedback pathway. The craft judgments, quality distinctions, and tacit standards sustained by professional logic are occluded from the institutional resources needed to sustain criterion-level evaluation.
Gioia, Schultz, and Corley (2000) explain why detection fails through "adaptive instability": organizations maintain stable identity labels while the meaning of those labels shifts. A development team that still calls itself "craftspeople" may no longer mean what it originally meant by "craft." From inside the organization, the migration feels like continuity rather than change.
Executive claims do not report findings. They constitute the evaluative frame before organizations test it operationally. Frontier lab CEOs predict that AI will eliminate half of entry-level jobs, or declare that GPUs will outwork humans. Those claims perform productivity and displacement into the default categories through which organizations evaluate their engagement. Through 2025, employers cited AI as the reason for approximately 55,000 US job cuts. Whether any specific reduction improved efficiency is not the relevant question. Headcount reduction circulates as the default demonstration of engagement success, performing a proxy (fewer people) into the position of a criterion (organizational efficiency) before anyone has checked whether the proxy tracks what the organization is accountable for. MacKenzie, 2006 Challenger, 2025
The IBM case shows the gap surfacing only when downstream consequences materialized. In May 2025, the CEO told the Wall Street Journal that AI had replaced the work of several hundred HR employees. Nine months later, the company announced it would triple entry-level hiring because cutting junior roles had collapsed the talent pipeline. The performative frame does not require empirical confirmation to operate. IBM, 2025-2026
The mechanism is self-reinforcing through two feedback pathways. As practitioner judgment erodes (or never forms), organizations rely more heavily on the proxy metrics that are still legible, which deepens proxy metric reliance and further reduces opportunities for consequence-based learning. Field-level discourse reinforces the engagement frame, which shapes how the next wave of organizations enters engagement, which produces more confirming experience, which reinforces the discourse.
Practitioner erosion removes the people who could detect the divergence between proxy metrics and accountable criteria. Organizational erosion removes the institutional pathways through which detection would reach decision-makers. Field-level performativity ensures that the starting frame is already oriented toward the proxy metrics. Sustained navigation requires active maintenance of evaluative capacity at all three levels simultaneously.
Electricity changed how physical goods were produced. Factory owners could still measure what mattered (output, cost, revenue, profit) the same way they always had, because the production substrate (physical manufacturing) and the evaluation substrate (financial accounting) were separate. For knowledge work mediated by AI, those substrates overlap. Organizations produce work in language, evaluate work in language, and the technology operates on language. The evaluation cannot stand outside the thing being evaluated.
Electricity did not generate attractive intermediate metrics between "we electrified the factory" and "let us check the books." AI fills that gap with a layer of highly legible proxy metrics (tasks completed, code merged, content produced) that arrive immediately, while financial outcomes lag. Organizations optimize against those proxies before lagging indicators reveal whether the proxies track what they are supposed to track.
The prevailing account in the productivity literature, the J-curve (Brynjolfsson, Rock, & Syverson), holds that general-purpose technologies require organizational reorganization before productivity gains become apparent. The Proxy Seduction Framework addresses a question the J-curve is silent on: what happens to evaluative capacity when the technology and the evaluation share a substrate. Both frameworks can be true simultaneously. Organizations may need time to reorganize (J-curve), and the evaluation of that reorganization may be systematically distorted (PSF). The empirical question is whether the perception-reality gap narrows as engagement matures (as the J-curve would expect) or persists and widens (as proxy seduction would expect when proxy optimization deepens).
The core pattern is established. Four conditions test whether the variables the mechanism specifies operate as claimed.
If experienced practitioners show gaps of the same magnitude as inexperienced practitioners on tasks they know well, the judgment claim is wrong. Daniotti's experience gradient and METR's quality-holding result are supportive. Brynjolfsson's top-quintile experts complicate the picture: they had the judgment but increased adherence. Whether they detected decline and lacked pathways, or whether the system suppressed detection, is unresolved.
If deliberate criterion-level evaluation produces no measurable difference in divergence, the braking claim is wrong. The Brynjolfsson firm maintained criterion-level evaluation tied to compensation, and drift happened anyway. Leonardi and Leavell's Mountain case shows constraining choices kept provisionality, but tests positioning, not evaluation practice. No study tests whether deliberate criterion-level evaluation, introduced as a practice, slows drift.
If organizations with sustained AI engagement show no narrowing of evaluative vocabulary from professional logic toward market logic, the institutional feedback claim is wrong. Cross-sectional evidence is consistent but no longitudinal study traces the loop. The weakest empirical area of the framework.
If the perception-reality gap disappears whenever practitioners recognize the divergence, then Goodhart's Law or organizational mandate explains the pattern and proxy seduction is unnecessary. METR provides the strongest evidence: developers had no incentive to overstate gains, yet the 39-point gap persisted through sincere belief. No study directly tests whether organizational-level drift persists when the full practitioner population recognizes the divergence.
The most critical AI investment is not in the model but in the infrastructure of evaluation. Whether proxy seduction better explains the trajectory is an empirical question that longitudinal data will resolve. But if the erosion claim holds, the evaluative capacity that organizations need to act on that resolution degrades while they wait.
Verification processes that operate independently of the AI-mediated production loop. Code review conducted by practitioners who build and maintain the system, not just people reviewing AI output. Quality audits calibrated to accountable criteria rather than proxy metrics. Evaluation rituals that require assessment against downstream consequences rather than upstream indicators. The DORA study showed that high-performing teams maintained both throughput and stability because their pre-existing evaluation infrastructure forced AI output through verification calibrated to accountable criteria. The key word is "independently." If the verification loop runs through the same AI-mediated process it is supposed to verify, it is not braking. It is proxy feedback confirming itself.
Work that translates between the proxy-level metrics an organization optimizes for and the criterion-level standards it is accountable to. The framework suggests boundary activity is more effective when it operates against, rather than alongside, the dominant institutional logic. The practical corollary is that boundary activity requires institutional protection (funding, authority, organizational visibility) precisely because it generates friction that market logic would prefer to eliminate.
The mechanism is not predictive. The framework does not say what will happen. The framework says what to check. For any specific AI engagement, an organization can ask:
The most consequential organizational investment the framework identifies is not in AI capability but in criterion-level evaluation infrastructure: the practices, roles, and institutional resources that maintain the distinction between what AI engagement makes legible and what the organization is accountable for producing. The reason the investment cannot be deferred is the final claim: the capacity organizations need to evaluate whether to act is being eroded by the thing they are evaluating.
Regulatory frameworks that mandate criterion-level evaluation (not just proxy-level reporting) would function as institutional braking. Professional bodies that maintain consequence-based certification standards would function as judgment preservation infrastructure. Both should prove more effective than technology-specific regulation, because they address the evaluation mechanism rather than the technology itself.