Gaining Ground,
Losing Sight

How AI Engagement Constitutes the Proxy Metrics That Erode Organizational Judgment

Progress Update and Empirical Plans

Vikram Bapat · Ecosystems, Platforms & Strategy · IfM, University of Cambridge

Speaker Notes
Welcome Michael. Thank you for making time. We have not met for over six weeks, and a lot has happened. This update has two parts. First, where the theoretical work stands, covering the framework, the evidence, and the publication strategy. Second, the empirical plans, covering the interview design, pilot strategy, and where your input would shape the next phase. I will move through the theory slides at pace since you know the foundations, and slow down on the empirical section where I most need your thinking.

The puzzle

Why do experienced practitioners sincerely believe AI is improving their work when independent measurement reveals persistent gaps between perceived and actual outcomes?
The pattern is not isolated to one profession. It recurs across software engineering, management consulting, and clinical medicine. It resists correction through expertise, feedback, and organizational learning.

The research question is not whether AI works. It is whether organizations can tell.

Speaker Notes
This is the puzzle that drives everything. The key word is 'sincerely.' These are not practitioners gaming metrics. They believe the improvements are real. And the pattern is not confined to one domain. It appears in software engineering with the METR study, in management consulting with Dell'Acqua, and in clinical medicine with Lebovitz. The second half of the question matters equally: why does the misjudgment resist correction? Expertise does not fix it. Feedback does not fix it. Organizational learning does not fix it. That resistance is what makes this a theory-level problem, not an implementation problem.

What existing theory misses

Existing Accounts
Each explains something.
None reaches the mechanism.
Institutional logics
Does not ask whether the technology constitutes the confirming evidence that selects between logics.
Sociomateriality
Assumes practitioners can evaluate their own reconstitution using unaffected criteria.
Dynamic capabilities
Assumes the sensing apparatus survives reconfiguration.
Goodhart's Law
Explains corruption through strategic gaming. The 39-point gap persists through sincere belief.
Integration Gap
No existing theory integrates
all four:
1. Technology constitutes new observables more legible than what the organization is accountable to
2. Engagement transforms the evaluator such that pre-engagement criteria become structurally unreliable
3. Institutional feedback reinforces proxy optimization through competing logic asymmetry
4. Field-level discourse constitutes the evaluative frame before any organization tests it operationally
The evaluative continuity assumption persists because no account has specified why, mechanistically, it should fail.
Theoretical Architecture
Four resources, sequentially
linked.
Performativity (MacKenzie, Callon)
Pre-legitimizes proxy metrics at the field level before any organization tests them.
Form/function (Faulkner & Runde)
Fertile form enables positioning drift. Same tool, different proxy-criterion gaps.
Transformative experience (Paul)
Pre-engagement evaluator is not the post-engagement evaluator.
Institutional logics (Thornton et al.)
AI systematically elevates proxy metrics while occluding accountable criteria.
The contribution is the integration.
Speaker Notes
You know these literatures, so I will be brief. The left column is what exists. Each framework explains part of the pattern. Institutional logics explains competing metrics but not how the technology constitutes the confirming evidence. Sociomateriality recognises mutual constitution but assumes the evaluator can assess their own reconstitution. Dynamic capabilities assumes the sensing apparatus survives. Goodhart explains gaming but not sincere belief. The centre column is the gap. No existing framework integrates all four features the evidence demands. The right column is what I am building. Four resources, sequentially linked. The contribution is the integration, not any individual resource. MacKenzie and Callon give me field-level performativity. Faulkner and Runde give me form-function positioning drift. Paul gives me evaluator transformation. Thornton gives me the logic asymmetry that drives the substitution. None alone reaches the mechanism. Together they trace how AI engagement may constitute the metrics that look like progress while eroding the judgment needed to see otherwise.

The emerging mechanism

AI engagement may simultaneously constitute the metrics that look like progress and erode the judgment needed to see otherwise.
1
Proxy elevation
AI constitutes legible, attractive metrics (speed, volume, certainty) that displace less visible accountable criteria. The new observables are more readily measured than what they replace.
2
Evaluator transformation
Engagement changes the practitioner such that pre-engagement judgment becomes unavailable for comparison. The evaluator assessing outcomes is not the evaluator who set the original criteria.
3
Capacity erosion
The organizational conditions required to detect the substitution degrade over time. The pipeline that builds judgment shifts from producing to reviewing, and feedback that would signal a problem looks like confirmation.

If the framework holds, this erosion is self-concealing: the same mechanism that produces the pattern prevents organizations from recognizing it.

Speaker Notes
Three phases, stated conditionally. Proxy elevation: AI constitutes new metrics that are more legible than what they displace. Evaluator transformation: the practitioner changes through engagement, so pre-engagement judgment is unavailable for comparison. This is Paul's contribution, and it is the precondition for the rest. Capacity erosion: the organizational conditions that would detect the substitution degrade over time. The pipeline shifts from producing to reviewing, and the feedback loops that would signal a problem look like confirmation. The self-concealing quality is the distinctive claim. This is not a framework that says 'things go wrong.' It says 'things go wrong in a way that prevents you from knowing they have gone wrong.'

Evidence constellation

Patterned, persistent, resistant to expertise
39pp
Perception-reality gap
Developers perceived 20% speedup. Objective measurement: 19% slower. Sincere belief, not gaming.
METR 2025 (RCT, N=16)
75%
Scaffolded, not internalized
AI closed 75% of education gap during execution. Gap reappeared in full when AI removed.
Cruces et al. 2026 (NBER)
94%
Idea overlap
AI-generated ideas: 94% overlap. Human ideas: 100% unique. Individual quality up, collective diversity collapsed.
Meincke, Collins & Evans 2025
Null
Effect at population scale
Widespread adoption, confident practitioners. Administrative records: null effects on earnings and hours.
Humlum & Vestergaard 2025
+34%
Novices gain, experts drift
Novice agents +34%. Expert agents: negligible gain, quality decline, increased AI adherence over time.
Brynjolfsson et al. 2025 (N=5,172)
90/14
Confidence decoupled from outcomes
90% of daily AI users confident. 14% achieve consistently positive outcomes. 37% of time saved lost to rework.
Workday 2026 (N=3,200)
Speaker Notes
Six pieces of evidence, each from an independent study, each showing a different facet of the same pattern. METR: 39 percentage-point gap between perceived and actual performance. Cruces: 75% of the education gap closes during AI use but reappears entirely when AI is removed. Scaffolding, not internalisation. Meincke: 94% overlap in AI-generated ideas while human ideas remain 100% unique. Individual quality rises while collective diversity collapses, and no individual can detect it. Humlum and Vestergaard: population-scale null effect on earnings and hours despite widespread adoption and practitioner confidence. Brynjolfsson: novices gain 34%, experts show negligible gain with quality decline and increasing AI adherence over time. Workday: 90% of daily users are confident, 14% achieve consistently positive outcomes. These are not cherry-picked. They converge on a single structure: the visible metric improves while the accountable criterion degrades, and the practitioners involved cannot see the divergence.

Where things stand

Three milestones since we last met
Milestone 1
Publication strategy taking shape
Two papers in progress, each with a distinct objective building on the theoretical foundation. One targets the performativity conversation in organization studies. The other engages management research with an explicit research agenda and empirical design. Both draw on PSF but contribute to different scholarly conversations.
Milestone 2
IfM PhD Conference (May 19)
Abstract submitted. Poster and 10-minute presentation in progress. Framework unnamed in public materials: "Proxy Metrics, Evaluative Capacity, and the Hidden Costs of AI Engagement." First public test of the argument with an academic audience.
Milestone 3
Empirical phase preparation
Interview protocol being developed. Pilot interviews planned for summer. Collaboration with Steven Clarke (Microsoft) and Katharine Norwood (Google) provides practitioner access across organizational contexts. Ethics approval timeline mapped.
Speaker Notes
Three milestones since we last met. First, two papers are taking shape, each targeting a different scholarly conversation. I can go into the details if useful, but the key point is that each makes a distinct contribution and they are complementary, not redundant. Second, the IfM PhD Conference on May 19 is the first public test of the argument. The framework is unnamed in the abstract and poster. The title describes the research territory. Third, the empirical phase is actively being prepared: protocol development, pilot planning, and collaboration with Steven Clarke at Microsoft and Katharine Norwood at Google for practitioner access.

Empirical plans

The theoretical framework proposes a mechanism. The empirical phase tests whether that mechanism operates in practice, and if so, how practitioners experience it.
Speaker Notes
Shifting to the empirical plans. This is where I most want your thinking.

Two populations, sequenced

Population 1: Frontline practitioners
Technology practitioners across professional domains
Surface how evaluative criteria shift during AI engagement. What practitioners measure, what they trust, how their sense of quality changes over time. The goal is to capture criteria shift in practitioners' own terms rather than imposing the framework's categories.
Population 2: Boundary activity performers
Roles where institutional expectations meet operational realities
Surface the translation work that mediates between what organizations measure and what they are accountable for. Tests whether and how proxy-criterion displacement propagates through organizational structures. Boundary activity may be distributed across roles rather than concentrated in designated positions.

Frontline first. The criteria shifts must be documented before tracing propagation. Current thinking is a 70/30 split favouring frontline practitioners.

Speaker Notes
Two populations, deliberately sequenced. Frontline practitioners surface criteria shift from inside. Boundary activity performers surface translation work between what organisations measure and what they are accountable for. Frontline first, because the criteria shifts need to be documented before I can trace how they propagate. Current thinking is a 70/30 split favouring frontline. That split is open for discussion, particularly given your work on boundary relations. The boundary activity framing comes from your 2012 paper. I am sampling for the activity, not the role, because translation work may be distributed across positions rather than concentrated in designated ones.

What PSF generates as interview questions

The framework does not ask "what do you think about AI?" It asks questions that only PSF would generate.
Proxy elevation
What do you measure now that you did not measure before AI?
What have you stopped tracking?
When you say the work is "better," what specifically are you comparing?
Evaluator transformation
Has your sense of what "good" looks like changed since you started using AI tools?
Could you do the same work without AI now that you could do before?
How do you know when to override the tool's suggestion?
Capacity erosion
How do new team members learn to evaluate quality?
When was the last time your team caught something AI-generated that looked right but was wrong?
What would have to happen for your organization to slow down AI adoption?
Speaker Notes
This slide shows what PSF generates as interview questions that a generic 'tell me about AI' protocol would not produce. Under proxy elevation: what do you measure now that you did not measure before? What have you stopped tracking? Under evaluator transformation: has your sense of what good looks like changed? Could you still do this work without AI? Under capacity erosion: how do new team members learn to evaluate quality? When did your team last catch something AI-generated that looked right but was wrong? These questions are non-obvious. They come directly from the mechanism. If the framework is right, these questions should surface criteria shift in practitioners' own language without leading them toward the theory.

Pilot plan and collaboration

Pilot interviews (Summer 2026)
3-5 frontline practitioners to test the protocol before main fieldwork. Goals: confirm PSF-generated questions surface criteria shift in practitioners' own language, test whether the three-phase structure organizes naturally or forces the data, and calibrate interview length and depth.
Ethics approval to be submitted May-June 2026. Main fieldwork planned for Autumn 2026.
Practitioner access
Steven Clarke (Microsoft, Developer Research) provides access to developer tool teams and practitioners working with AI-assisted development workflows. His research on API usability connects directly to evaluative criteria shift in tool-mediated work.
Katharine Norwood (Google, UX Research) provides access to practitioners navigating AI integration across product and research contexts. Her work surfaces exactly the boundary activity the framework theorizes.

Open questions: the right balance between frontline and boundary activity interviews, how to identify boundary activity when it is distributed across roles, and how to sequence paper submissions relative to the empirical timeline.

Speaker Notes
The pilot is planned for this summer, 3 to 5 frontline practitioners. The goals are to test whether the PSF-generated questions land, whether the three-phase structure organises the data naturally or forces it, and to calibrate interview length and depth. Steven Clarke provides access to developer tool teams at Microsoft. His research on API usability connects directly to evaluative criteria in tool-mediated work. Katharine Norwood provides access to practitioners navigating AI integration at Google. Her UX research work surfaces exactly the boundary activity the framework theorises. Ethics approval submission is planned for May to June. Main fieldwork for autumn.

Discourse traps as analytical infrastructure

Practitioner discourse does not just describe proxy seduction. It naturalizes it. The diagnostic identifies discursive moves that conceal evaluative capacity erosion through sincere belief.
Observed (8 traps, inductive)
Historical Normalization: "We survived calculators."
Expertise Immunity: "Good developers will be fine."
Judgment-as-Bottleneck: "AI removes the slow part."
Conditional Stack: "As long as you review carefully..."
Present-Tense Projection: "I can still tell the difference."
Recognition-as-Protection: "I know the risks."
Fundamentals Endurance: "The fundamentals don't change."
Transition Naturalization: "This is just how work evolves."
Hypothesized (6 traps, deductive)
Augmentation Frame: "AI augments, it doesn't replace."
Democratic Access: "AI levels the playing field."
Reversibility Assumption: "We can always go back."
Metrics Improvement: "The numbers are up across the board."
Feedback Confidence: "We'd know if something was wrong."
Individual Agency: "I choose when to use it."
If hypothesized traps appear in interviews, that constitutes evidence of PSF's generative power.
Analytical function
Each trap has a PSF counter-question designed for interview follow-up. The counter-question probes the assumption without leading the interviewee.
Traps co-occur in predictable clusters. Historical Normalization + Fundamentals Endurance + Transition Naturalization produces a "sorting narrative" that frames displacement as natural evolution.
The diagnostic is a living document. Fieldwork may reveal new traps, collapse existing ones, or move hypothesized traps to observed status.
Speaker Notes
This is the analytical infrastructure I have been building alongside the theoretical framework. Fourteen discourse traps that identify how practitioner talk naturalises evaluative capacity erosion. Eight are inductively observed from practitioner discourse: things like Historical Normalisation, where practitioners say 'we survived calculators,' or Expertise Immunity, where they say 'good developers will be fine.' Six are deductively hypothesised from the PSF mechanism: things the framework predicts practitioners should say but I have not yet observed in the wild. If the hypothesised traps surface in interviews, that is evidence of the framework's generative power. It predicts discursive forms it has not yet seen. That distinction between observed and hypothesised traps is itself a methodological contribution worth flagging.

Fieldwork coding rubric

Five dimensions coded simultaneously as interview data flows in.
D1
Trap identification
Which traps is the practitioner deploying? Multiple can co-occur. Track clusters.
D2
Mechanism component
Precondition, dual constitution, proxy elevation, capacity erosion (individual, organizational, field), or self-concealing quality.
D3
Evidence category
Practitioner account, corporate claim, empirical data, supporting theory, or foil. Aligned with the Evidence Constellation.
D4
Epistemic status
Confirming, extending, qualifying, or challenging the mechanism.
D5
Interview context
Frontline practitioner or boundary activity performer. PSF predicts traps distribute differently across the two populations.
Speaker Notes
The fieldwork coding rubric handles five dimensions simultaneously. Dimension one: which trap is being deployed. Dimension two: which PSF mechanism component the data touches. Dimension three: evidence category, aligned with the evidence constellation so data can flow between tools. Dimension four: epistemic status, whether the data confirms, extends, qualifies, or challenges the mechanism. Dimension five: interview context, because PSF predicts traps distribute differently across frontline practitioners and boundary activity performers. Frontline practitioners should deploy more individual-level traps. Boundary activity performers should surface organisational-level patterns.

Pilot opportunity: AI Dev Conference

The opportunity
A concentrated pool of frontline technology practitioners actively engaged with AI tools. The conference environment provides natural access to developers, team leads, and product managers who are immersed in the discourse the framework theorizes about.
Short, focused conversations (20-30 minutes) rather than full protocol interviews. The goal is to test whether PSF-generated questions land in practitioners' own language, not to collect primary data.
Informal setting lowers defensiveness. Practitioners at conferences are already in a reflective mode.
What the pilot tests
Question clarity: Do PSF-generated questions surface criteria shift without leading?
Trap recognition: Which discourse traps appear spontaneously? Do hypothesized traps surface?
Domain vocabulary: How do practitioners in this community talk about evaluation, quality, and judgment?
Protocol calibration: Which questions open rich conversation and which fall flat?
Boundary activity signals: Do any practitioners spontaneously describe translation work between institutional expectations and operational realities?
Speaker Notes
The AI Dev Conference offers a concentrated pool of frontline technology practitioners who are actively engaged with AI tools. The format is short, focused conversations rather than full protocol interviews. The goal is to test the protocol, not collect primary data. The conference environment is useful because practitioners are already in a reflective mode about their practice, and the informal setting lowers defensiveness. Five things to test: whether PSF questions land clearly, which discourse traps appear spontaneously, how practitioners in this community talk about evaluation and quality, which questions open rich conversation and which fall flat, and whether anyone spontaneously describes boundary activity.

From pilot to main fieldwork

The pilot produces four deliverables that shape the main fieldwork.
1
Refined interview protocol
Questions tested against real practitioners. Probes that surface criteria shift kept. Questions that lead or confuse cut. Domain-specific vocabulary incorporated. Length and depth calibrated.
2
Trap distribution baseline
Which of the 14 traps appeared spontaneously? Did hypothesized traps surface (evidence of generative power) or remain absent (refines the theory)? Which traps clustered together?
3
Sampling refinement
Does the 70/30 frontline-to-boundary-activity split hold, or does the pilot suggest a different balance? Are there practitioner profiles the protocol does not reach that it should?
4
Rubric calibration
Test the five-dimension coding scheme against real data. Identify dimensions that are easy to code, dimensions that require judgment calls, and any dimension gaps the pilot reveals.

The pilot is not a smaller version of the main study. It is a methodological instrument that sharpens every element of the research design before committing to full-scale data collection.

Speaker Notes
The pilot produces four deliverables. A refined interview protocol with questions tested against real practitioners. A trap distribution baseline showing which traps appeared, whether hypothesised traps surfaced, and how they clustered. Sampling refinement, including whether the 70/30 split holds. And rubric calibration testing the five-dimension coding scheme against real data. The pilot is not a smaller version of the main study. It is a methodological instrument that sharpens every element of the research design before committing to full-scale data collection.

Why two papers, not one

Paper A
Organization Science (Perspectives)
Conversation it enters: Performativity and the constitutive role of technology in organizational evaluation. Anchored in Cabantous & Gond (2011).
Distinct contribution: Shorter format, broader theoretical scope. Argues that AI engagement is a case of performativity that existing performativity scholarship has not theorized: the technology constitutes the metrics by which its own effects are evaluated.
Evidence additions: Dell'Acqua et al. (consulting), Lebovitz et al. (clinical medicine, pre-LLM). Broadens the evidence beyond software development.
Paper B
International Journal of Management Reviews
Conversation it enters: Management research on AI and organizations, with explicit engagement across six literature streams.
Distinct contribution: Longer format, four research questions, full research agenda. Engages surrogation, competency traps, and automation complacency as adjacent constructs and shows what PSF adds beyond each.
Structure: Review-style paper with theoretical framework, evidence constellation, and empirical design. Designed to serve as the reference paper for the PhD thesis.

The papers are complementary, not redundant. OS establishes the theoretical move (performativity + evaluative capacity). IJMR provides the full apparatus (mechanism, evidence, research agenda). Each can stand alone. Together they build the publication record toward a thesis by papers.

Speaker Notes
The question of why two papers deserves a direct answer. They enter different scholarly conversations. The OS paper targets the performativity literature. Its anchor is Cabantous and Gond 2011. Its contribution is showing that AI engagement is a case of performativity that existing performativity scholarship has not theorised. The IJMR paper targets the management research conversation. It provides the full apparatus: mechanism, evidence, four research questions, and a research agenda. It engages surrogation, competency traps, and automation complacency as adjacent constructs and shows what PSF adds beyond each. The papers are complementary. OS establishes the theoretical move. IJMR provides the reference paper for the thesis. Each can stand alone. Together they build the publication record for a thesis by papers.