Four paired-measure specifications for coding interview transcripts into dumbbell chart visualizations. Each pairing maps a proxy-side measure against an accountable-criterion-side measure. The gap between the two dots is the PSF mechanism made visible.
Tests whether practitioners who report the highest improvement can articulate the thinnest evaluative criteria, which is what the self-concealing mechanism predicts.
| Score | Label | Indicators |
|---|---|---|
| 1 | Net negative | Reports AI engagement has made work worse, slower, or less reliable on balance |
| 2 | Marginal | Mixed signals. Some improvements noted but hedged with significant caveats |
| 3 | Moderate positive | Clear improvement in specific areas (speed, boilerplate) but not generalized to overall quality |
| 4 | Strong positive | Improvement claimed across multiple dimensions. Confident language. Few unprompted caveats |
| 5 | Transformative | Describes AI as fundamentally changing what is possible. Near-total confidence in AI-assisted output |
| Score | Label | Indicators |
|---|---|---|
| 1 | Proxy-only | All quality language references observable proxies: speed, volume, test pass rate, "it works" |
| 2 | Proxy-dominant | Mostly proxy language with vague gestures toward deeper criteria ("should be maintainable" with no elaboration) |
| 3 | Mixed | Both proxy and accountable criteria present. Can name dimensions like robustness but doesn't spontaneously elaborate |
| 4 | Criteria-rich | Multiple accountable criteria with specificity. Concrete examples of judgment calls. Can describe what "good" looks like beyond metrics |
| 5 | Criteria-sovereign | Multi-dimensional quality framework. Explains why certain criteria resist measurement. Can describe the PSF mechanism in own vocabulary |
Cell 1 (senior, deep engagement): widest bars. High SAI (4-5), low-to-moderate CAD (1-3). Evaluative erosion progressed furthest.
Cell 2 (senior, shallow engagement): narrow bars or reversed. Moderate SAI (2-3), high CAD (4-5). Pre-engagement criteria vocabulary intact.
Cell 3 (junior, deep engagement): moderate-to-wide bars. Moderate SAI (3-5), low CAD (1-2). Criteria never formed. Premature arrest signal.
Instead of holistic ratings, this uses transcript coding to produce proportional scores. More labor-intensive but more defensible in a methods section. Requires inter-rater reliability on 20% of transcripts.
Cell 1: proxy-dominant (PR > CR). Cell 2: consequence-dominant (CR > PR). Cell 3: proxy-dominant with low total utterance count (thin vocabulary overall, not just tilted). Cell 3 signals premature arrest: consequence vocabulary never built, not displaced.
Inter-rater: Required for this pairing given three-way classification. Sample size, reliability statistic, and threshold to be designed with supervisor.
| Score | Label |
|---|---|
| 1 | Very low. Expects problems. Treats AI output as draft requiring full review |
| 2 | Low. Checks extensively. Can name specific failure modes to look for |
| 3 | Moderate. Trusts within bounds, hedges beyond them |
| 4 | High. Spot-checks rather than reviews. Trusts unless given reason not to |
| 5 | Near-total. Reports AI output as reliable or more reliable than own unaided work |
| Score | Normalized FEC |
|---|---|
| 1 | 0 episodes (no concrete friction recalled) |
| 2 | 1 episode |
| 3 | 2-3 episodes |
| 4 | 4-5 episodes |
| 5 | 6+ episodes |
Inverse relationship. High confidence paired with low friction count is the self-concealing signal. The cross-cutting probe disambiguates: absence of friction vs. absence of feedback loops that would surface friction.
Where a boundary activity performer can be matched to frontline practitioners from the same team or organization. Dot A comes from the frontline interview ("What does your dashboard track?"). Dot B comes from the boundary interview ("What do you pay attention to that the dashboard doesn't show?").
| Score | Label | Description |
|---|---|---|
| 1 | Pure proxy | Metric or concern is entirely about speed, volume, throughput |
| 2 | Proxy-leaning | Mostly operational metrics with one nod to quality |
| 3 | Mixed | Both proxy and accountable concerns present |
| 4 | Accountable-leaning | Primary concern is judgment quality, downstream reliability, criteria integrity |
| 5 | Pure accountable | Entirely about evaluative capacity, team judgment, or criteria the dashboard cannot capture |
Wide bars = boundary performer monitoring something the frontline does not report. This is boundary activity doing its work: holding criteria the dashboard dropped. Narrow bars = either convergence (Cell 2 teams) or second-order erosion (boundary performer captured by proxy metric).
Controls for the dispositional confound. If Cell 2 practitioners appear evaluatively intact because they are systematic by temperament (not because shallow engagement preserved their criteria), the PSF claim weakens. This dimension is orthogonal to the 2x2: it describes how someone approaches evaluation, not whether they adopted AI. Code from technology-history questions and general (non-AI-specific) descriptions of their evaluation process.
| Code | Label | Indicators |
|---|---|---|
| O | Opportunistic | Grabs the fastest available signal and moves on. Evaluates by whatever is most legible in the moment. Pre-AI: relied on test pass rates, manager approval, or "it shipped." Post-AI: relies on whichever proxy AI makes effortlessly available. Phrases like "I look at what's fastest and iterate from there." |
| P | Pragmatic | Weighs tradeoffs situationally. Uses proxies when stakes are low, shifts to accountable criteria when stakes are high. Calibrates effort to perceived risk. Phrases like "It depends on the context and the risk" or "For a quick fix I trust it, for core infrastructure I check everything." |
| S | Systematic | Builds evaluative frameworks deliberately. Maintains explicit criteria checklists (mental or written). Evaluates against failure modes by default, not just when prompted. Phrases like "I check against a list of things that have burned me before" or "I always ask what happens at 3am when nobody's watching." |
Systematic × Cell 1: The strongest test case. If a systematic practitioner in Cell 1 still shows a wide SAI-CAD gap, the proxy seduction claim is strong because their disposition should have protected them. If they show a narrow gap, they represent a natural resistance case worth probing for recovery mechanisms.
Opportunistic × Cell 1: Potential boundary condition. If opportunistic Cell 1 practitioners had thin criteria vocabulary even when describing pre-AI work, AI did not seduce them away from accountable criteria. They were never oriented that way. Wide gaps here may overstate PSF's explanatory reach.
Pragmatic × Cell 1 vs. Pragmatic × Cell 2: The directional test. PSF predicts AI engagement gradually shifts pragmatists toward the opportunistic end because proxy metrics become so much easier to access that the threshold for "when stakes are high enough" keeps rising. If pragmatic Cell 1 practitioners show wider gaps than pragmatic Cell 2 practitioners, that shift is the mechanism in action.
Cell 3 (any work style): If Cell 3 shows uniformly low CAD across all three work styles (opportunistic, pragmatic, and systematic juniors all scoring CAD 1-2), the premature arrest claim holds because the absence of criteria vocabulary cuts across dispositions.
Each pairing produces a JSON array for the dumbbell React component:
Pairings 1 and 3 use holistic rating scales. Defend with Saldana (2016) on magnitude coding and Miles, Huberman and Saldana (2020) on within-case and cross-case displays.
Pairing 2 uses frequency-based content analysis. Defend with Krippendorff (2018).
The dumbbell visualization is a form of Tufte's (1983) "paired comparison" display. It is a pattern-display device, not a statistical test. The visual precedes the argument.
Include full coding rubric in appendix or supplementary materials, following transparency norms in Pratt (2009) on crafting qualitative research for AMR.