← Research Hub
RESEARCH DESIGN

Dumbbell Chart Coding Rubrics

Four paired-measure specifications for coding interview transcripts into dumbbell chart visualizations. Each pairing maps a proxy-side measure against an accountable-criterion-side measure. The gap between the two dots is the PSF mechanism made visible.

1
Self-Assessed Improvement vs. Criteria Articulation Depth
The core PSF dumbbell

Tests whether practitioners who report the highest improvement can articulate the thinnest evaluative criteria, which is what the self-concealing mechanism predicts.

Dot A: Self-Assessed Improvement (SAI)
Source: "What made it good work?" / "How long did it take?" / "How confident are you it will hold up?"
ScoreLabelIndicators
1Net negativeReports AI engagement has made work worse, slower, or less reliable on balance
2MarginalMixed signals. Some improvements noted but hedged with significant caveats
3Moderate positiveClear improvement in specific areas (speed, boilerplate) but not generalized to overall quality
4Strong positiveImprovement claimed across multiple dimensions. Confident language. Few unprompted caveats
5TransformativeDescribes AI as fundamentally changing what is possible. Near-total confidence in AI-assisted output
Dot B: Criteria Articulation Depth (CAD)
Source: "What made it good work?" / "What are you looking for in review?" / "What doesn't the dashboard track?"
ScoreLabelIndicators
1Proxy-onlyAll quality language references observable proxies: speed, volume, test pass rate, "it works"
2Proxy-dominantMostly proxy language with vague gestures toward deeper criteria ("should be maintainable" with no elaboration)
3MixedBoth proxy and accountable criteria present. Can name dimensions like robustness but doesn't spontaneously elaborate
4Criteria-richMultiple accountable criteria with specificity. Concrete examples of judgment calls. Can describe what "good" looks like beyond metrics
5Criteria-sovereignMulti-dimensional quality framework. Explains why certain criteria resist measurement. Can describe the PSF mechanism in own vocabulary
Key distinction: criteria named vs. criteria operationalized. "Maintainability" without elaboration = score 2-3. "Maintainability, meaning a new developer six months from now can read this code path and understand the design intent" = score 4-5. Code spontaneous articulation only. Note prompted criteria separately in memo.
PSF Predictions

Cell 1 (senior, deep engagement): widest bars. High SAI (4-5), low-to-moderate CAD (1-3). Evaluative erosion progressed furthest.

Cell 2 (senior, shallow engagement): narrow bars or reversed. Moderate SAI (2-3), high CAD (4-5). Pre-engagement criteria vocabulary intact.

Cell 3 (junior, deep engagement): moderate-to-wide bars. Moderate SAI (3-5), low CAD (1-2). Criteria never formed. Premature arrest signal.

2
Proxy Vocabulary Ratio vs. Consequence Vocabulary Ratio
Linguistic version, content-analysis based

Instead of holistic ratings, this uses transcript coding to produce proportional scores. More labor-intensive but more defensible in a methods section. Requires inter-rater reliability on 20% of transcripts.

Procedure: Extract every quality utterance from the transcript. Classify each as proxy, accountable, or ambiguous. Calculate proxy ratio (PR) and consequence ratio (CR) per practitioner. PR + CR + ambiguous = 1.0.
Proxy criteria
Observable, AI-legible, countable
Speed and throughput: time to completion, velocity, story points, PRs merged per sprint
Volume: lines of code, features shipped, tickets closed, commits
Test pass rate: unit tests passing, CI/CD green, coverage percentage
Surface cleanliness: linting, formatting, no warnings, "clean" code
User satisfaction proxies: CSAT score, NPS, ticket resolution time
Accountable criteria
Judgment-dependent, contextual, downstream
Robustness: behavior under untested conditions, edge cases, failure recovery
Maintainability: readability for future developers, design intent legibility, tradeoff documentation
Architectural coherence: system-level design integrity, coupling awareness, scalability implications
Domain appropriateness: whether the solution fits the business problem, not just the technical specification
Evaluative judgment: explicit reasoning about tradeoffs, "why this and not that"
Failure mode awareness: what could go wrong, how would we know, what feedback loop would surface it
Ambiguous
Classify as ambiguous unless the practitioner elaborates
"Quality" (unelaborated), "it works," "it's good," "reliable," "correct"
PSF Predictions

Cell 1: proxy-dominant (PR > CR). Cell 2: consequence-dominant (CR > PR). Cell 3: proxy-dominant with low total utterance count (thin vocabulary overall, not just tilted). Cell 3 signals premature arrest: consequence vocabulary never built, not displaced.

Inter-rater: Required for this pairing given three-way classification. Sample size, reliability statistic, and threshold to be designed with supervisor.

3
Friction Episode Count vs. Confidence Rating
The simplest pairing, minimal interpretive coding
Dot A: Confidence Rating
ScoreLabel
1Very low. Expects problems. Treats AI output as draft requiring full review
2Low. Checks extensively. Can name specific failure modes to look for
3Moderate. Trusts within bounds, hedges beyond them
4High. Spot-checks rather than reviews. Trusts unless given reason not to
5Near-total. Reports AI output as reliable or more reliable than own unaided work
Dot B: Friction Episode Count (FEC)
Concrete instances where AI output produced downstream problems, required non-trivial rework, introduced subtle errors, or bypassed judgment steps. Near-misses count. All spontaneous mentions included.
ScoreNormalized FEC
10 episodes (no concrete friction recalled)
21 episode
32-3 episodes
44-5 episodes
56+ episodes
Important outlier: Confidence 5 / Friction 5 is someone who has experienced multiple failures and remains maximally confident. PSF interprets this as evidence the correction loop is not functioning: proxy confidence regenerates even after friction events.
PSF Prediction

Inverse relationship. High confidence paired with low friction count is the self-concealing signal. The cross-cutting probe disambiguates: absence of friction vs. absence of feedback loops that would surface friction.

4
Reported Team Metric vs. Boundary Performer's Actual Monitor
Boundary activity interviews only, matched teams

Where a boundary activity performer can be matched to frontline practitioners from the same team or organization. Dot A comes from the frontline interview ("What does your dashboard track?"). Dot B comes from the boundary interview ("What do you pay attention to that the dashboard doesn't show?").

ScoreLabelDescription
1Pure proxyMetric or concern is entirely about speed, volume, throughput
2Proxy-leaningMostly operational metrics with one nod to quality
3MixedBoth proxy and accountable concerns present
4Accountable-leaningPrimary concern is judgment quality, downstream reliability, criteria integrity
5Pure accountableEntirely about evaluative capacity, team judgment, or criteria the dashboard cannot capture
PSF Prediction

Wide bars = boundary performer monitoring something the frontline does not report. This is boundary activity doing its work: holding criteria the dashboard dropped. Narrow bars = either convergence (Cell 2 teams) or second-order erosion (boundary performer captured by proxy metric).

Cross-cutting: Work Style Orientation
Applied to every practitioner, coded independently of AI engagement

Controls for the dispositional confound. If Cell 2 practitioners appear evaluatively intact because they are systematic by temperament (not because shallow engagement preserved their criteria), the PSF claim weakens. This dimension is orthogonal to the 2x2: it describes how someone approaches evaluation, not whether they adopted AI. Code from technology-history questions and general (non-AI-specific) descriptions of their evaluation process.

Why not conservative/moderate/adventurous? That axis redescribes the adoption stance. A "conservative" developer who resists AI maps almost perfectly onto Cell 2, so you measure the same thing twice. Opportunistic/pragmatic/systematic cuts at the work practice, not the personality, and creates genuinely informative cross-tabulations.
CodeLabelIndicators
O Opportunistic Grabs the fastest available signal and moves on. Evaluates by whatever is most legible in the moment. Pre-AI: relied on test pass rates, manager approval, or "it shipped." Post-AI: relies on whichever proxy AI makes effortlessly available. Phrases like "I look at what's fastest and iterate from there."
P Pragmatic Weighs tradeoffs situationally. Uses proxies when stakes are low, shifts to accountable criteria when stakes are high. Calibrates effort to perceived risk. Phrases like "It depends on the context and the risk" or "For a quick fix I trust it, for core infrastructure I check everything."
S Systematic Builds evaluative frameworks deliberately. Maintains explicit criteria checklists (mental or written). Evaluates against failure modes by default, not just when prompted. Phrases like "I check against a list of things that have burned me before" or "I always ask what happens at 3am when nobody's watching."
Coding source: Code from the technology-history question ("When your team adopted containers/microservices/new frameworks, how did you approach that?") and from how they describe their general evaluation process in non-AI-specific language. Do not code from their AI adoption stance directly. The point is to measure work style independently of the engagement axis.
Analytically Informative Cross-tabulations

Systematic × Cell 1: The strongest test case. If a systematic practitioner in Cell 1 still shows a wide SAI-CAD gap, the proxy seduction claim is strong because their disposition should have protected them. If they show a narrow gap, they represent a natural resistance case worth probing for recovery mechanisms.

Opportunistic × Cell 1: Potential boundary condition. If opportunistic Cell 1 practitioners had thin criteria vocabulary even when describing pre-AI work, AI did not seduce them away from accountable criteria. They were never oriented that way. Wide gaps here may overstate PSF's explanatory reach.

Pragmatic × Cell 1 vs. Pragmatic × Cell 2: The directional test. PSF predicts AI engagement gradually shifts pragmatists toward the opportunistic end because proxy metrics become so much easier to access that the threshold for "when stakes are high enough" keeps rising. If pragmatic Cell 1 practitioners show wider gaps than pragmatic Cell 2 practitioners, that shift is the mechanism in action.

Cell 3 (any work style): If Cell 3 shows uniformly low CAD across all three work styles (opportunistic, pragmatic, and systematic juniors all scoring CAD 1-2), the premature arrest claim holds because the absence of criteria vocabulary cuts across dispositions.

Coding Workflow

1
First pass (within 48 hours): Listen/read full transcript. Write a 200-word memo capturing initial impressions, notable quotes, and provisional cell assignment.
2
Second pass (coding): Apply Pairing 1 and Pairing 3 rubrics. Record scores in the tracker. Flag difficult-to-assign cases.
3
Third pass (vocabulary coding): Extract quality utterances, classify per Pairing 2 taxonomy, calculate PR and CR. Batch after every 10 interviews.
4
Inter-rater check: [PLACEHOLDER: sample size, selection method, reliability statistic, and threshold to be determined with supervisor. Common starting points: 20% of transcripts, stratified across cells, Krippendorff's alpha or Cohen's kappa, but the right design depends on coding complexity and resource constraints.]
5
Pairing 4 coding: Apply after both frontline and boundary activity interviews complete for matched teams.

Visualization Specification

Layout
Dumbbell chart. One row per practitioner (P01, P02, etc.). Two dots per row. Bar connecting them. Shift value on right margin.
Colors
Blue: gap favors accountable criteria. Orange: gap favors proxy criteria. Grey: gap within ±1 point.
Cell dots
Cell 1 Cell 2 Cell 3 Boundary activity
Sorting
Default by gap magnitude. Toggle: by cell, seniority tier, or engagement depth.
Hover
Practitioner ID, cell, seniority, Dot A score, Dot B score, gap, one-line memo quote.

Output Format

Each pairing produces a JSON array for the dumbbell React component:

[ { "id": "P01", "cell": 1, "seniority": "senior", "dotA": 4.0, "dotB": 2.0, "shift": -2.0, "quote": "I trust it completely now. It just works." } ]

Methodological Notes for the Paper

Pairings 1 and 3 use holistic rating scales. Defend with Saldana (2016) on magnitude coding and Miles, Huberman and Saldana (2020) on within-case and cross-case displays.

Pairing 2 uses frequency-based content analysis. Defend with Krippendorff (2018).

The dumbbell visualization is a form of Tufte's (1983) "paired comparison" display. It is a pattern-display device, not a statistical test. The visual precedes the argument.

Include full coding rubric in appendix or supplementary materials, following transparency norms in Pratt (2009) on crafting qualitative research for AMR.