Back to toolkit

The Proxy Seduction Mechanism

How AI engagement constitutes attractive proxy metrics that displace accountable criteria while eroding the judgment that would detect the substitution.
Theory/Framework
Empirical evidence
PSF contribution
v3.0 · March 2026
Bapat, Urmetzer, Norwood, Clarke

The Perception-Reality Gap

Organizations engaging with AI consistently perceive gains that systematic measurement fails to confirm. The gap is not noise. It is patterned (perception consistently exceeds measurement), persistent (it holds across skill levels and domains), and resistant to expertise (experienced practitioners show it on dimensions the engagement newly constitutes, even where their domain judgment remains intact).

INDIVIDUAL
Experienced open-source developers predicted a 24% speedup from AI assistance, perceived a 20% speedup after working, and were objectively measured at 19% slower. Code quality held equal across conditions. The perception-reality gap was 39 percentage points.
RCT METR, 2025b · N=16 experienced developers
ORGANIZATIONAL
Average productivity rose 15% with AI assistance, but expert agents gained approximately nothing with measurable quality decline. Experts increased adherence to AI recommendations over time.
FIELD STUDY Brynjolfsson, Li, & Raymond, 2025 · N=5,172 agents
ORGANIZATIONAL
85% of employees reported saving time with AI. Only 14% consistently achieved clear positive net outcomes.
ENTERPRISE SURVEY Workday, 2026
POPULATION
Workers across 11 AI-exposed occupations self-reported 2.8% time savings. Administrative records (tax and employment data, not self-report) showed null effects on earnings and hours, ruling out changes above 2%, even among daily users and early adopters reporting substantial benefits.
ADMIN DATA Humlum & Vestergaard, 2025 · Full Danish working-age population
FIELD
Individual stories rated more creative. The population of stories converged 10.7% semantically. No individual writer could detect the pattern from inside their own work.
EXPERIMENT Doshi & Hauser, 2024

Each level reveals a version of the gap that the actors at the previous level are structurally incapable of detecting. The developer cannot see their own slowdown. The organization cannot see the expert degradation hiding behind aggregate improvement. No writer evaluating their own output can see population-level convergence. Something structural is producing this pattern.

The Title, Unpacked

The paper's title is its argument in fifteen words. Every word does specific theoretical work.

Gaining Ground

Real gains, real metrics, sincere belief. Organizations are not wrong about what they observe. The problem is that the ground itself has been reconstituted by the engagement: what counts as progress changed. The gains are genuine, just not gains on the dimensions the organization is accountable to.

Losing Sight

Not looking away. The instrument of evaluation has been reshaped by the engagement itself. Organizations lose sight because what they see with has changed, not because they stopped looking. The gaining and the losing are not sequential. They are the same process: proxy elevation and evaluative erosion are constitutively linked.

Engagement

Not a one-time adoption decision. The mechanism operates through sustained work with the technology. The transformation happens in the doing, not at the moment an organization decides to use AI.

Constitutes

Not "causes" or "reveals." The engagement brings these metrics into existence as legitimate measures of organizational performance. Before engagement, speed and volume were observable but subordinate. After engagement, they are what evaluation runs on. The frame is made, not found.

Proxy Metrics

Speed, volume, apparent certainty. These do not pre-exist the engagement in their current form. The engagement constitutes them as the most legible, responsive indicators available. They measure real properties, but real is not the same as accountable.

Erode

Gradual, self-reinforcing, invisible from inside. Not a discrete failure but a progressive loss of evaluative capacity. Failure is self-announcing. Degradation is self-concealing. Proxy seduction works because it looks like progress.

Organizational Judgment

The unit of analysis is the organization, not the practitioner. Individual practitioners can recognize the divergence and still not arrest the drift. The erosion operates through institutional feedback loops that persist regardless of individual awareness.

Proxy Seduction, Defined

Proxy seduction is the process by which engagement-elevated proxy metrics (speed, volume, certainty) displace accountable criteria (robustness, validity, defensibility) as the basis of organizational evaluation.

Sincere

Practitioners genuinely believe tradeoffs have dissolved. The substitution is not gaming. It is invisible from inside the engagement.

Constitutive

The technology produces the metrics that drive the substitution. The proxy metrics do not pre-exist the engagement. The engagement makes them.

Self-reinforcing

The engagement erodes the capacity that would catch the substitution. Each cycle deepens the trap.

KEY DISTINCTION: PROXY SEDUCTION VS. GOODHART'S LAW
Goodhart's Law
People knowingly game a metric. They understand the metric no longer tracks the real thing and optimize against it anyway. The problem is strategic. The fix is an HR problem.
Proxy Seduction
The technology makes the wrong metric the most visible one, and practitioners follow it because it genuinely looks like the right signal. The problem is epistemic. The fix is structural. The METR developers were in a controlled experiment with no incentive to overstate gains.

The Mechanism

Figure 1: How AI engagement constitutes attractive proxy metrics that displace accountable criteria while eroding the judgment that would detect the substitution

Each section below unpacks one component, moving from precondition through engagement, dual transformation, evaluative capacity, and three-level erosion.

Divergence in Kind

The mechanism only activates when the proxy metric and the accountable criterion diverge in kind, not in degree. Speed and robustness are different things, not different precisions of the same thing. If the divergence were in degree, the proxy would be a noisier version of the criterion, and the fix would be straightforward: refine the measurement, reduce the noise, and the gap closes. Calibration problems like these are well understood.

The framework makes a different claim. No amount of refining the proxy brings it closer to the criterion, because the two are not on the same dimension. The divergence is categorical.

Proxy metric (legible) Accountable criterion (occluded) Evidence
Speed of code generation Code robustness under stress METR
Resolution speed Resolution quality Brynjolfsson et al.
Individual output quality Collective diversity Doshi & Hauser
Simulation certainty Epistemic provisionality Leonardi & Leavell
Task completion during AI use Competence retained after AI removed Cruces et al. 2026

Better measurement makes it worse, not better. A more precise speed metric is still not measuring robustness, but its precision confers authority. The problem is not measurement quality. The problem is measurement direction.

Three Properties That Intensify the Problem

The evaluative continuity assumption (that the people evaluating a technology's impact remain competent judges through the engagement) could fail under any transformative technology. Three properties of AI explain why it fails here. Faulkner & Runde, 2019

Output Mimicry

AI produces outputs whose form is indistinguishable from competent human work. The formal markers of earned competence (confidence, completeness, procedural correctness) arrive without the epistemic process that would produce them. Practitioners infer that the underlying work has improved correspondingly, because the improvement in form is mistaken for improvement in substance. Cruces et al. (2026) close the loop experimentally: AI closed 75% of the education-based productivity gap during task execution, but the gap reappeared in full when AI assistance was removed. The competence was scaffolded, not internalized.

Substrate Breadth

Most technologies constitute proxies within a bounded domain. A CRM generates sales metrics. An ERP generates operational metrics. AI operates on language, the medium through which most knowledge work is conducted, evaluated, and communicated. The proxies AI elevates (speed, volume, throughput) appear across most organizational functions, and their pervasiveness is part of what makes them look like the right metrics rather than convenient substitutes.

Capability Pace

Earlier transformative technologies stabilized long enough for practitioners to develop feedback loops connecting evaluation to consequences. AI capabilities change roughly every seven months by recent estimates. The technology shifts what the engagement involves before practitioners consolidate their assessment of the previous version. Model capabilities shift between versions, new tool integrations alter workflows, and the evaluation target moves while the evaluator is still forming judgment about where it was.

Organizational positioning determines which of these properties become salient, which is why blanket claims about AI fail. The same technology produces different proxy-criterion gaps depending on how the organization positions it. AI's technical form is what the paper characterizes as unusually fertile: it can be positioned for a wider range of system functions than most prior technologies. Form underdetermines function, and organizational positioning resolves the underdetermination. Faulkner & Runde, 2019

What AI Engagement Constitutes

Given the precondition, AI engagement constitutes two things simultaneously.

Proxy Elevation

Speed, volume, and certainty become highly visible and legible. These proxy metrics show progress and respond to effort. Meanwhile, the criteria the organization is actually accountable to (whether the code is robust under stress, whether the contract holds under challenge, whether the analysis is valid under scrutiny) become harder to see. The accountable criteria have not disappeared. The proxy metrics are so much more legible that the accountable criteria recede into the background.

Leonardi & Leavell: Same Tool, Different Outcomes

Two urban planning organizations used the same AI simulation tool. One constrained visual detail, positioned the tool as one input among several, and gave stakeholders access to model internals. Their stakeholders maintained provisional language about 20-year projections. The other amplified all three dimensions. Their stakeholders treated 20-year projections as settled fact, with 83% of statements framed as absolutes. Same technology, different organizational positioning, different proxy-criterion gaps. Leonardi & Leavell, 2026

Evaluator Transformation

The practitioner assessing work after deep AI engagement is not the same evaluator who set the original criteria. Paul's (2014) account of transformative experience provides the cognitive foundation: some experiences are epistemically transformative (they teach the person something that could not be learned without the experience) and personally transformative (they reshape preferences and self-understanding). If AI engagement is transformative in this sense, the practitioner who set pre-engagement criteria is not the practitioner evaluating outcomes post-engagement. Paul, 2014

Anthropic's internal study of 132 engineers illustrates the structure. Over 12 months, self-reported Claude usage rose from 28% to 59% of daily work, and self-reported productivity gains rose from 20% to 50%. Engineers simultaneously reported concerns about skill atrophy and described a "paradox of supervision": using AI effectively requires the very skills that may erode through sustained AI use. The practitioners articulated the tension between their original criteria and the metrics that increasingly structured their work. Usage and perceived productivity continued to rise anyway. Anthropic, 2025

Proxy metric elevation and evaluator transformation converge into the central mechanism: proxy seduction. Proxy metrics displace accountable criteria as the actual basis of organizational evaluation. Organizations are not gaming their metrics. Practitioners sincerely believe the tradeoffs that used to constrain their work have dissolved. The engagement made the proxy metric the most legible indicator available, the output looked competent, and quality shifted without anyone choosing to trade it away.

Three Dimensions of Evaluative Capacity

Whether proxy seduction produces organizational drift or gets caught depends on the organization's evaluative capacity. The three dimensions are the framework's original contribution, not borrowed from any of the four theoretical resources.

Detection

Does anyone in the organization register that the observables it is optimizing against are not the criteria it is accountable to? Detection is not noticing. The popular prescription (better dashboards, clearer signals, more attentive management) assumes noticing is the bottleneck. Detection requires epistemic confidence: trusting your own judgment when the system's output projects authority. AI outputs carry formal markers of competent work, and those markers have been reliable quality signals throughout practitioners' careers. Overriding that learned inference takes more than seeing a red flag. It takes trusting yourself over the machine when the machine's output looks professional.

The METR data reveal a subtler form. Developers revised their expected speedup downward from 24% to 20% after working, a correction carrying the surface markers of epistemic responsibility. But the objective outcome was -19%, leaving a 39-point gap. Partial correction deepens concealment by lending credibility to a still-false estimate. At the organizational level, an executive who revises projected AI gains from 30% to 20% sounds empirically grounded, and the revision becomes evidence of good judgment, even if the actual outcome is negligible or negative.

Judgment Stock

Can anyone discriminate proxy from criterion? The source of that discrimination is consequence exposure: practitioners who have shipped code that broke, written contracts that failed, or produced analyses that misled under scrutiny develop tacit knowledge about what "good" means in their domain. The inoculation is domain-specific. The METR developers held quality judgment (where they had deep consequence exposure) while failing on speed estimation (newly constituted by the engagement). Same developers, same tasks, different dimensions, different outcomes.

The same structure appears across the evidence base. Brynjolfsson's top-quintile agents maintained resolution quality judgment but not adherence judgment. Doshi and Hauser's writers maintained individual output quality but could not detect population-level convergence. Practitioners are protected where their experience gives direct access to the accountable criterion, and exposed on dimensions the engagement newly constitutes.

Braking

Can the organization translate detection and judgment into action? Braking is not stopping progress. It is active, engineered governance: verification processes independent of the AI-mediated production loop, evaluation rituals assessing against accountable criteria, and institutional pathways connecting what practitioners detect with how the organization makes decisions. DORA (2025) high-performing software teams maintained both throughput and stability because pre-existing evaluation infrastructure forced AI output through verification calibrated to accountable criteria. DORA, 2025

An organization can possess detection, judgment stock, and braking potential and still fail. Brynjolfsson's firm had all three capacities. No institutional pathway connected what the experts knew to how the organization evaluated the engagement. Capacity without institutional pathways defaults to drift.

AI Engagement Erodes Evaluative Capacity at Three Levels

Each level's erosion accelerates the other two.

PRACTITIONER LEVEL

Two dynamics operate. Experienced practitioners lose consequence exposure through disuse as AI handles more of the work that used to build their judgment. Practitioners entering the workforce after AI engagement is established face a more fundamental problem: the evaluative capacity may never form in the first place. Shen and Tamkin (2026) demonstrate the mechanism in a randomized experiment with professional software developers: AI-assisted developers scored 17% lower on comprehension, with the largest deficits in debugging, the very skill required to validate AI output. The interaction patterns that developers reported as most productive (pure delegation, progressive reliance, iterating through the AI rather than through the problem) were precisely the patterns that prevented learning. Shen & Tamkin, 2026

Bastani, Bastani, and Sungu (2025) confirm the metacognitive dimension: learners using standard AI assistance reported confidence in learning that did not occur. Bastani et al., 2025

Compounding both erosion and non-formation, the feedback practitioners receive from AI-assisted work is not absent but replaced. The code compiles, the tests pass, the style guidelines are met. The feedback is thorough, immediate, and entirely about explicit, codifiable properties. The silence on tacit properties (robustness under changing requirements, maintainability by other developers over time, architectural soundness under conditions the tests did not cover) does not register as missing information. The silence registers as confirmation that everything is fine.

ORGANIZATIONAL LEVEL

Confirming experience compounds the erosion. When an organization evaluates AI engagement on speed and throughput, and the engagement produces confirming results on those dimensions, the evaluation criteria that produced those results get reinforced. The practices that would sustain evaluation against accountable criteria (code review depth, editorial judgment, quality audit rigor) lose funding, staffing, and attention. Thornton, Ocasio, & Lounsbury, 2012

The institutional logics perspective explains the mechanism. Market logic propagates through Weber and Glynn's (2006) typification loop: proxy-confirming experiences accumulate into organizational routines. Professional logic loses its feedback pathway. The craft judgments, quality distinctions, and tacit standards sustained by professional logic are occluded from the institutional resources needed to sustain criterion-level evaluation.

Gioia, Schultz, and Corley (2000) explain why detection fails through "adaptive instability": organizations maintain stable identity labels while the meaning of those labels shifts. A development team that still calls itself "craftspeople" may no longer mean what it originally meant by "craft." From inside the organization, the migration feels like continuity rather than change.

FIELD LEVEL

Executive claims do not report findings. They constitute the evaluative frame before organizations test it operationally. Frontier lab CEOs predict that AI will eliminate half of entry-level jobs, or declare that GPUs will outwork humans. Those claims perform productivity and displacement into the default categories through which organizations evaluate their engagement. Through 2025, employers cited AI as the reason for approximately 55,000 US job cuts. Whether any specific reduction improved efficiency is not the relevant question. Headcount reduction circulates as the default demonstration of engagement success, performing a proxy (fewer people) into the position of a criterion (organizational efficiency) before anyone has checked whether the proxy tracks what the organization is accountable for. MacKenzie, 2006 Challenger, 2025

The IBM case shows the gap surfacing only when downstream consequences materialized. In May 2025, the CEO told the Wall Street Journal that AI had replaced the work of several hundred HR employees. Nine months later, the company announced it would triple entry-level hiring because cutting junior roles had collapsed the talent pipeline. The performative frame does not require empirical confirmation to operate. IBM, 2025-2026

Reinforcing Loops

The mechanism is self-reinforcing through two feedback pathways. As practitioner judgment erodes (or never forms), organizations rely more heavily on the proxy metrics that are still legible, which deepens proxy metric reliance and further reduces opportunities for consequence-based learning. Field-level discourse reinforces the engagement frame, which shapes how the next wave of organizations enters engagement, which produces more confirming experience, which reinforces the discourse.

Practitioner erosion removes the people who could detect the divergence between proxy metrics and accountable criteria. Organizational erosion removes the institutional pathways through which detection would reach decision-makers. Field-level performativity ensures that the starting frame is already oriented toward the proxy metrics. Sustained navigation requires active maintenance of evaluative capacity at all three levels simultaneously.

The Electricity Contrast

Electricity changed how physical goods were produced. Factory owners could still measure what mattered (output, cost, revenue, profit) the same way they always had, because the production substrate (physical manufacturing) and the evaluation substrate (financial accounting) were separate. For knowledge work mediated by AI, those substrates overlap. Organizations produce work in language, evaluate work in language, and the technology operates on language. The evaluation cannot stand outside the thing being evaluated.

Electricity did not generate attractive intermediate metrics between "we electrified the factory" and "let us check the books." AI fills that gap with a layer of highly legible proxy metrics (tasks completed, code merged, content produced) that arrive immediately, while financial outcomes lag. Organizations optimize against those proxies before lagging indicators reveal whether the proxies track what they are supposed to track.

The prevailing account in the productivity literature, the J-curve (Brynjolfsson, Rock, & Syverson), holds that general-purpose technologies require organizational reorganization before productivity gains become apparent. The Proxy Seduction Framework addresses a question the J-curve is silent on: what happens to evaluative capacity when the technology and the evaluation share a substrate. Both frameworks can be true simultaneously. Organizations may need time to reorganize (J-curve), and the evaluation of that reorganization may be systematically distorted (PSF). The empirical question is whether the perception-reality gap narrows as engagement matures (as the J-curve would expect) or persists and widens (as proxy seduction would expect when proxy optimization deepens).

Robust Pattern, Open Mechanism

ROBUST
Perception-reality gap replicated across developers, customer service agents, creative writers, students, Danish working-age population
Domain-specificity of judgment: METR quality vs. speed, Brynjolfsson experts vs. adherence, individual creativity vs. population diversity
Organizational positioning matters: same tool, different outcomes (Leonardi & Leavell)
Expert seduction: expertise does not protect on engagement-constituted dimensions
OPEN
No single study tests the full causal chain from performativity through evaluator transformation
No longitudinal study traces the typification loop operating over time
Judgment erosion rests on cross-sectional evidence, not within-practitioner longitudinal data
Alternative explanations (optimism bias) not cleanly separated in every case, though cross-domain pattern is difficult for simpler accounts

What Would Disconfirm the Framework

The core pattern is established. Four conditions test whether the variables the mechanism specifies operate as claimed.

Experienced practitioners should detect divergence where they have experience
SUPPORTED WITH COMPLICATION

If experienced practitioners show gaps of the same magnitude as inexperienced practitioners on tasks they know well, the judgment claim is wrong. Daniotti's experience gradient and METR's quality-holding result are supportive. Brynjolfsson's top-quintile experts complicate the picture: they had the judgment but increased adherence. Whether they detected decline and lacked pathways, or whether the system suppressed detection, is unresolved.

Deliberate criterion-level evaluation should slow proxy drift
PARTIALLY DISCONFIRMING

If deliberate criterion-level evaluation produces no measurable difference in divergence, the braking claim is wrong. The Brynjolfsson firm maintained criterion-level evaluation tied to compensation, and drift happened anyway. Leonardi and Leavell's Mountain case shows constraining choices kept provisionality, but tests positioning, not evaluation practice. No study tests whether deliberate criterion-level evaluation, introduced as a practice, slows drift.

Evaluative vocabulary should narrow under sustained engagement
EMPIRICALLY OPEN

If organizations with sustained AI engagement show no narrowing of evaluative vocabulary from professional logic toward market logic, the institutional feedback claim is wrong. Cross-sectional evidence is consistent but no longitudinal study traces the loop. The weakest empirical area of the framework.

Proxy drift should persist even when practitioners recognize the divergence
SUPPORTED

If the perception-reality gap disappears whenever practitioners recognize the divergence, then Goodhart's Law or organizational mandate explains the pattern and proxy seduction is unnecessary. METR provides the strongest evidence: developers had no incentive to overstate gains, yet the 39-point gap persisted through sincere belief. No study directly tests whether organizational-level drift persists when the full practitioner population recognizes the divergence.

Invest in Braking

The most critical AI investment is not in the model but in the infrastructure of evaluation. Whether proxy seduction better explains the trajectory is an empirical question that longitudinal data will resolve. But if the erosion claim holds, the evaluative capacity that organizations need to act on that resolution degrades while they wait.

Criterion-Level Evaluation Infrastructure

Verification processes that operate independently of the AI-mediated production loop. Code review conducted by practitioners who build and maintain the system, not just people reviewing AI output. Quality audits calibrated to accountable criteria rather than proxy metrics. Evaluation rituals that require assessment against downstream consequences rather than upstream indicators. The DORA study showed that high-performing teams maintained both throughput and stability because their pre-existing evaluation infrastructure forced AI output through verification calibrated to accountable criteria. The key word is "independently." If the verification loop runs through the same AI-mediated process it is supposed to verify, it is not braking. It is proxy feedback confirming itself.

Boundary Activity

Work that translates between the proxy-level metrics an organization optimizes for and the criterion-level standards it is accountable to. The framework suggests boundary activity is more effective when it operates against, rather than alongside, the dominant institutional logic. The practical corollary is that boundary activity requires institutional protection (funding, authority, organizational visibility) precisely because it generates friction that market logic would prefer to eliminate.

Three Diagnostic Questions

The mechanism is not predictive. The framework does not say what will happen. The framework says what to check. For any specific AI engagement, an organization can ask:

Is anyone in the organization registering that the metrics being optimized might not track what the organization is accountable for? If no one is registering that, the fix is not better dashboards. It is rebuilding the epistemic confidence to trust human judgment over the system's authority.
Do the people with consequence-based judgment have institutional pathways to act on what they know? Having the right people is not enough if no pathway connects their knowledge to organizational decisions.
Are evaluation rituals assessing against accountable criteria, or only against the metrics the engagement made legible? If the quarterly review measures speed, throughput, and volume but does not independently assess robustness, defensibility, or validity, the organization is evaluating the proxy.

The most consequential organizational investment the framework identifies is not in AI capability but in criterion-level evaluation infrastructure: the practices, roles, and institutional resources that maintain the distinction between what AI engagement makes legible and what the organization is accountable for producing. The reason the investment cannot be deferred is the final claim: the capacity organizations need to evaluate whether to act is being eroded by the thing they are evaluating.

Implications for Policy

Regulatory frameworks that mandate criterion-level evaluation (not just proxy-level reporting) would function as institutional braking. Professional bodies that maintain consequence-based certification standards would function as judgment preservation infrastructure. Both should prove more effective than technology-specific regulation, because they address the evaluation mechanism rather than the technology itself.