PSF THEORY HUB — ADJACENT DOMAIN

AI Diagnostics in India

Research landscape review of AI deployment in Indian diagnostic centers across radiology, pathology, and clinical laboratory domains. Tracking open questions that exhibit PSF-relevant dynamics: proxy metric substitution, evaluative capacity erosion, and self-concealing degradation in a regulated, life-critical context structurally different from the software development mainline.

RESEARCH INTEREST — NOT COMMITTED ← Theory Hub

India deploys AI diagnostic tools at population scale across TB screening, diabetic retinopathy detection, and breast cancer triage. The strongest evidence exists in radiology (Qure.ai's qXR across 2,600+ sites, Google/ARDA across 600,000+ patients in Tamil Nadu). Pathology and clinical lab AI remain nearly unstudied, with SigTuple's blood smear analyzer representing essentially the entire India-specific evidence base. The regulatory framework is evolving rapidly (CDSCO draft SaMD guidance October 2025, SAHI national strategy February 2026) but critical gaps persist. The most consequential finding: zero prospective studies track whether AI deployment actually improves patient outcomes in Indian diagnostic settings.

DOMAIN COVERAGE

RADIOLOGY / MEDICAL IMAGING

Mature deployment, concentrated evidence

TB screening via chest X-ray has the deepest evidence base. Qure.ai's qXR achieved 0.93 sensitivity for culture-confirmed TB in Chhattisgarh tribal populations, meeting WHO's target product profile. Real-world Nagpur deployment showed 15.8% TB yield increase attributable to AI alone. Diabetic retinopathy screening shows 600,000+ patient-scale validation (Google/ARDA) with 97.0% sensitivity, but a sobering Punjab multi-algorithm evaluation found sensitivity varied from 59.7% to 74% across vendors. Niramai's thermal breast cancer screening reached 15,069 women across 183 sites.

Strongest evidence base Population-scale deployment Vendor performance heterogeneity

PATHOLOGY / HISTOPATHOLOGY

Striking evidence deficit

India has approximately 5,500 qualified pathologists for roughly 300,000 labs. Despite this severe shortage, the entire peer-reviewed evidence base consists essentially of SigTuple validation studies (AI100 robotic microscope, FDA 510(k) cleared, 81.8% sensitivity for immature granulocyte detection, 100% for blast detection). No large-scale prospective studies of AI-assisted histopathology in Indian diagnostic centers were identified. Most labs lack whole slide scanners and digital infrastructure.

~5,500 pathologists for 300,000 labs SigTuple only substantive player Infrastructure-constrained

CLINICAL LABORATORY

Essentially unstudied

Beyond SigTuple's blood smear work, no India-specific peer-reviewed studies of AI deployment in clinical biochemistry, microbiology, or hematology labs were identified. Major chains (Dr. Lal PathLabs, SRL/Agilus, Thyrocare) are investing in AI-integrated LIMS platforms, but published validation data is absent. This gap is especially consequential because clinical lab tests drive the majority of diagnostic decisions in Indian healthcare.

No peer-reviewed deployment studies Major chains investing Majority of diagnostic decisions

STRUCTURAL CONTEXT

WORKFORCE SHORTAGE

AI as necessity, not augmentation

India has roughly 1 radiologist per 100,000 people (versus 1 per 10,000 in the US), 5,500 pathologists for 300,000 labs, and bears 28% of the global TB burden. Nearly half of trained radiologists emigrate. CT/MRI tests per 1,000 population stand at 36 in India versus 407 in the US. This shortage means AI is not deployed as a decision-support augmentation layer but as a capacity substitute, fundamentally changing the evaluative relationship between human and tool.

REGULATORY LANDSCAPE

Principles-based, not yet prescriptive

CDSCO classifies AI diagnostic software using a four-tier risk system (Class A through D). The October 2025 Draft Guidance on Medical Device Software introduces the Algorithm Change Protocol concept but remains in draft. ICMR Ethical Guidelines (March 2023) and the SAHI framework (February 2026, 32 recommendations) provide normative structure but lack legal force. No public registry of approved AI devices, no NABL standards for AI-augmented labs, no AI-specific liability framework, no court precedents, and no mandatory post-market surveillance exist.

INFRASTRUCTURE GAP

The regions needing AI most can support it least

75% of India's medical infrastructure serves the 27% living in urban areas. Fewer than 10% of hospitals use electronic health records. A field study documented 112 hours of work lost per data entry center due to power interruptions alone. DICOM/HL7/FHIR standards are recommended but adoption is inconsistent. Cloud-based AI requiring stable connectivity faces fundamental deployment barriers in rural and semi-urban settings where diagnostic need is greatest.

The open questions below exhibit structural homology with the PSF mechanism: proxy metrics displacing accountable criteria while eroding the evaluative capacity needed to detect the substitution. Each question is annotated with its PSF mapping and what it would take to study empirically. Ordered from most to least PSF-resonant.

CORE PSF INSTANCES

PSF INSTANCE — PROXY SUBSTITUTION

1. The outcome measurement gap

No prospective, controlled study tracks whether AI deployment in Indian diagnostic settings improves patient outcomes. The metrics being celebrated (turnaround time, throughput, scans processed, revenue growth) are legible proxies. The accountable criterion (did the patient get a better diagnosis, and did that lead to better health?) is harder to measure, slower to materialize, and structurally invisible in the current ecosystem. Nobody is gaming the system. Vendors sincerely believe faster TAT equals better care. Government agencies sincerely report consultation volumes as evidence of transformation.

PSF mapping: The proxy substitution operates through genuine conviction, not strategic evasion. The proxy metrics look like progress because they are progress on the dimensions being measured, while the dimensions that would reveal degradation go untracked. Self-concealing because the measurement framework itself has been constituted by the engagement.

Empirical hook: Compare the metrics diagnostic centers report to investors and regulators (TAT, volume, revenue) against the metrics that would reveal diagnostic quality trends (false negative rates tracked longitudinally, patient outcomes at 6/12 months, concordance with gold-standard confirmation). The gap between what is measured and what matters is directly observable.

Core PSF instance Sincere belief, not gaming Self-concealing Externally verifiable ground truth

PSF INSTANCE — EVALUATIVE CAPACITY EROSION

2. Deskilling and automation bias

A 2025 multicenter colonoscopy study (Lancet) found adenoma detection rates dropped from 28.4% to 22.4% when endoscopists worked without AI after routine AI exposure. A mammography automation bias study (Radiology, 2023) showed inexperienced readers' accuracy plummeting from 79.7% to 19.8% when AI was incorrect. Over 30% of pathology study participants reversed correct initial diagnoses when exposed to incorrect AI suggestions. No India-specific empirical study on deskilling or automation bias among Indian diagnosticians exists.

PSF mapping: Direct demonstration of the core PSF prediction. Engagement with AI erodes the evaluative capacity needed to detect the substitution. India's context (1 radiologist per 100,000, AI deployed as necessity) means the "human-in-the-loop" model faces maximum stress. Conditions for developing independent diagnostic judgment (repeated practice, feedback, effortful interpretation) are precisely the conditions being eliminated. Capacity to catch AI errors degrades in tandem with increasing dependence on AI.

Empirical hook: Compare diagnostic accuracy of Indian radiologists/pathologists with and without AI assistance, stratified by length of AI exposure. Interview diagnosticians about how their interpretive practices have changed. Track whether training programs are adjusting curricula in response to AI integration.

Core PSF instance International evidence, no India data Necessity-based deployment amplifies risk

PSF INSTANCE — SELF-CONCEALING DEGRADATION

3. Feedback loop contamination

5C Network's quality model: every expert correction feeds back into the AI, with 10,000+ corrections daily. Appears virtuous. But if expert corrections are themselves shaped by automation bias (the expert was primed by the AI's initial read before "correcting" it), the feedback loop amplifies rather than corrects errors. Ground truth becomes contaminated. A landmark MIS Quarterly study found five AI tools at a major US hospital all reported high performance but none met expectations because ground truth labels were unreliable.

PSF mapping: Textbook self-concealing degradation. The quality assurance mechanism (human correction feeding back into the model) is itself compromised by the automation bias the system is supposed to prevent. Internal metrics show improvement while the ground truth against which those metrics are measured has been contaminated. The degradation looks like a well-functioning system.

Empirical hook: Audit the correction pipeline at a high-volume AI radiology platform. Compare corrections made by radiologists who have seen the AI pre-read versus blind independent reads. Measure whether the correction rate converges toward agreement with AI over time, independent of diagnostic accuracy improvements.

Core PSF instance Ground truth contamination Looks like improvement

STRUCTURAL PSF DYNAMICS

PSF DYNAMIC — CONCEALED PROXY SUBSTITUTION

4. Dataset bias as invisible accuracy gap

AI tools trained on Western populations report high sensitivity and specificity that become the basis for deployment in India. The form (AUC figures, FDA clearance) is legible and credible. The function (does this tool work for a 45-year-old woman from a tribal community in Chhattisgarh with dense breast tissue and comorbid malnutrition?) is structurally unverifiable. No published study directly compares performance of Western-trained diagnostic AI on Indian versus Western populations head-to-head. Skin lesion classification algorithms show approximately 50% lower accuracy on darker-skinned patients.

PSF mapping: The proxy is "accuracy validated in controlled settings" and the accountable criterion is "accuracy for the actual population being served." The validated accuracy figure is a fertile form: it satisfies regulatory and procurement requirements. But the function in the Indian deployment context may diverge substantially. The divergence is unmeasured, so it cannot trigger correction. Form/function gap (Faulkner and Runde) operating at the evidence level.

Structural PSF dynamic Form/function gap (Faulkner and Runde) Unmeasured divergence

PSF DYNAMIC — INSTITUTIONAL DETECTION FAILURE

5. Regulatory scaffolding gap

No NABL standards for AI-augmented laboratories. No public registry of approved AI devices. No post-market surveillance for AI. No AI-specific liability framework. SAHI recommends "human-in-the-loop" but provides no mechanism to verify the loop is substantive rather than nominal. Accountable criteria require institutional scaffolding: someone has to define what counts as quality, measure it, and create consequences for degradation.

PSF mapping: The regulatory framework is fertile in form (principles, recommendations, strategic documents) but thin in function (binding standards, enforcement mechanisms, public registries). The form of governance looks like governance. The function of governance (detecting and correcting degradation) is not yet operational. Proxy substitution proceeds unchecked by institutional correction, not through regulatory negligence but because the functional infrastructure does not yet exist. Institutional logics (Thornton, Ocasio, and Lounsbury) become analytically productive here.

Structural PSF dynamic Institutional logics Form without function

SECONDARY PSF RELEVANCE

LOWER PSF RESONANCE — COVERAGE GAP

6. Pathology/clinical lab evidence deficit

Primarily a coverage gap rather than a proxy substitution dynamic, but it creates the conditions for proxy substitution to emerge unchecked. When an entire domain (100,000+ labs, majority of diagnostic decisions) lacks evidence infrastructure, there is no measurement framework against which degradation could be detected even in principle. The gap is pre-PSF: the conditions for the mechanism to operate without detection are established before the mechanism is activated.

Pre-condition, not instance Creates space for PSF

LOWER PSF RESONANCE — EQUITY

7. Health equity and the urban-rural divide

Whether AI deployment reduces or reproduces diagnostic inequality operates at a different analytical level than PSF's organizational-evaluative-capacity mechanism. Becomes PSF-relevant specifically where urban AI-augmented centers report impressive throughput metrics while rural facilities lack basic infrastructure, and aggregate numbers conceal the distributional pattern. The proxy (national screening volume) can improve while the accountable criterion (equitable access to accurate diagnosis) degrades.

Different analytical level PSF-adjacent when aggregated

CROSS-DOMAIN THEORETICAL VALUE

Why diagnostics matters for PSF as theory. If PSF validates only in software development, reviewers can ask whether the mechanism is domain-specific. Diagnostics differs on nearly every observable dimension (regulated versus unregulated, life-critical versus commercial, image interpretation versus text generation, professional licensing versus informal credentialing) while plausibly exhibiting the same deep structure. Diagnostics also offers externally verifiable ground truth (biopsy results, culture confirmations, patient outcomes) against which proxy metrics can be compared. The PSF prediction is that despite these structural differences, the same mechanism operates. If that holds across both domains, the theoretical contribution is substantially stronger. If it does not hold, the boundary conditions are themselves theoretically productive.

Selected sources organized by domain and type. Captures the evidence most directly informing the PSF-relevant open questions. Full source list is in the research report.

DEPLOYMENT STUDIES — RADIOLOGY AND IMAGING

Source	Finding	PSF Relevance
Qure.ai / Chhattisgarh tribal TB study Open Forum Infectious Diseases, 2025	qXR sensitivity 0.93, specificity 0.75 for culture-confirmed TB, meeting WHO TPP	Validates accuracy in Indian population, but outcome data absent
Nagpur private labs implementation PLOS Digital Health, 2023	15.8% TB yield increase attributable to AI, from cases missed by radiologists	AI catching what humans miss, but no longitudinal tracking
Google/ARDA Tamil Nadu JAMA Network Open, 2025	600,000+ patients, 97.0% sensitivity, 96.4% specificity for severe DR	Largest post-marketing AI device evaluation globally; still process metrics
Punjab multi-algorithm DR evaluation JMIR Medical Informatics, 2025	Sensitivity ranged 59.7% to 74% across five AI systems	Performance heterogeneity concealed by aggregate accuracy claims
Niramai Thermalytix npj Digital Medicine, 2024	15,069 women, 27 cancers, 81.8% biopsy PPV	Radiation-free alternative for mammography-absent settings
12 AI solutions independent evaluation Scientific Reports, 2021	Qure.ai and Delft both AUC 0.82, outperformed intermediate human readers	Independent evaluation, cross-sectional only

DESKILLING AND AUTOMATION BIAS — INTERNATIONAL EVIDENCE

Source	Finding	PSF Relevance
Multicenter colonoscopy study Lancet, 2025	Adenoma detection dropped 28.4% to 22.4% when AI removed after routine exposure	Direct evidence of evaluative capacity erosion
Mammography automation bias Radiology, 2023	Inexperienced readers: 79.7% to 19.8% when AI incorrect; +12% false positive recalls	Automation bias strongest in less experienced clinicians
Pathology diagnostic reversal studies Multiple sources	30%+ reversed correct diagnoses when exposed to incorrect AI suggestions	AI priming overrides independent clinical judgment
Deskilling scoping review Artificial Intelligence in Medicine, 2026	Skill atrophy, automation complacency, reduced deliberate practice identified	Comprehensive review, no India-specific studies exist
Hospital AI evaluation MIS Quarterly	5 tools reported high metrics; none met expectations due to unreliable ground truth	Ground truth contamination undermines evaluation framework

REGULATORY AND POLICY

Source	Status
CDSCO Draft Guidance on Medical Device Software	October 2025. Introduces Algorithm Change Protocol. Consultation complete, finalization pending.
ICMR Ethical Guidelines for AI in Biomedical Research	March 2023. 10 principles. Normative, not legally binding.
SAHI Framework + BODH Benchmarking	February 2026. 32 recommendations. Risk classification, training data standards, explainability.
Digital Personal Data Protection Act 2023	No special category for health data. Proposed DISHA Act never enacted.
IRIA radiologist survey (PMC, 2025)	95.3% want more AI education. 27.9% fear job displacement. 20% national digital health literacy.

STARTUP ECOSYSTEM

Company	Domain	Scale
Qure.ai	Chest X-ray, head CT, lung CT	$121-156M raised. 2,600+ sites, 100+ countries, 18 FDA clearances.
SigTuple	Blood smear microscopy, pathology	FDA 510(k) cleared AI100. AS76 launched Feb 2026.
Remidio	Diabetic retinopathy, ophthalmology	CDSCO approved. Kerala Nayanamritham 2.0. 250,000+ patients.
Wadhwani AI	DR, TB (cough audio), skin disease	Official MoH AI partner since 2022. USAID funded.
5C Network	AI-native teleradiology	10,000+ scans/day, 1,500+ facilities.
DeepTek	TB screening (Genki)	1,800+ hospitals, 21 states, 1M+ screened.
Niramai	Breast cancer (thermal imaging)	183 locations in Punjab. Radiation-free, portable.

The outcome-measurement gap identified in Indian AI diagnostics is not India-specific. It is a structural failure of the entire global AI diagnostics field. This tab documents the global evidence desert and the CAD historical precedent, both of which strengthen the PSF case by showing that proxy substitution in diagnostics operates at the level of the field itself, not just within individual organizations or national contexts.

THE GLOBAL EVIDENCE DESERT

GLOBAL FINDING

One RCT, one mortality benefit: the full extent of hard evidence

Across more than 1,000 FDA-cleared AI medical devices, fewer than 1% have any evidence of improved patient outcomes. Only one randomized controlled trial in the history of medicine has demonstrated that an AI diagnostic tool reduces patient mortality: Lin et al. (2024, Nature Medicine), a pragmatic RCT of 15,965 hospitalized patients in Taiwan where an AI-ECG system predicted mortality risk and alerted clinicians. 90-day all-cause mortality fell from 4.3% to 3.6% (HR 0.83). Eric Topol responded: this is the first time AI has been shown to be lifesaving.

A handful of prospective (non-randomized) sepsis prediction studies showed mortality improvements: COMPOSER at UC San Diego Health (17% relative decrease), TREWS at Johns Hopkins (faster antibiotics, reduced death), InSight across 9 US hospitals (39.5% reduction, but before-and-after design with significant confounding). Everything else in the AI diagnostics literature stops at measuring accuracy.

1 RCT with mortality benefit <1% of FDA devices have outcome data Global, not India-specific

SYSTEMATIC REVIEW EVIDENCE

A field stuck at surrogate endpoints

Han et al. (2024, Lancet Digital Health): reviewed 86 RCTs of AI in clinical practice. 81% reported positive primary endpoints, but these were primarily diagnostic yield or performance, not patient outcomes.

Plana et al. (2022, JMIR): 46% of 39 AI RCTs used diagnostic accuracy as their primary endpoint. Not a single trial measured patient-reported outcomes.

Zhou et al. (2021, npj Digital Medicine): among 65 RCTs of AI prediction tools, nearly 40% showed no clinical benefit compared to standard care, despite median development AUROCs of 0.81. High accuracy did not translate to clinical impact.

JACC Advances (2025): systematic review of cardiovascular AI RCTs. Zero of 11 trials measured patient-reported outcomes.

Macheka et al. (2024, BMJ Oncology): prospective AI studies in cancer care concluded that most failed to translate measured AI efficacy into beneficial clinical outcomes, coining the term "AI chasm."

PSF reading: the systematic review literature demonstrates that the proxy substitution (accuracy metrics displacing outcome measurement) is not a local oversight but a field-level structural pattern. The Fryback-Thornbury hierarchy (1991) defines six levels of diagnostic evidence, from technical efficacy through patient outcome efficacy. Almost all AI diagnostic research operates at Levels 1 and 2. The critical insight: efficacy at a lower level does not guarantee efficacy at a higher level. Better sensitivity does not necessarily mean fewer deaths.

Multiple systematic reviews Fryback-Thornbury hierarchy Accuracy ≠ outcomes

SCREENING PARADOX

The colonoscopy paradox: more detection, same outcomes

AI-assisted colonoscopy has generated more RCTs than any other AI diagnostic application (28+ trials, 23,000+ patients). Meta-analyses consistently show AI increases adenoma detection by approximately 20% and reduces miss rates by 55%. These numbers sound impressive until one examines what is actually being detected. The additional adenomas are predominantly diminutive polyps (≤5mm) with minimal malignant potential. The CADILLAC trial (Spain, 2023, Annals of Internal Medicine, 3,213 patients, 6 centers) found no difference in advanced neoplasia detection (34.8% AI vs. 34.6% control). AI increased non-neoplastic polyp removal by 39%, meaning more unnecessary procedures. Not a single AI colonoscopy trial has measured colorectal cancer incidence or mortality.

PSF reading: this is proxy substitution made visible in clinical data. The proxy (adenoma detection rate) improves. The accountable criterion (colorectal cancer prevented or caught earlier) does not change. The improved proxy actively generates costs (unnecessary procedures) while producing no measurable patient benefit. The detection looks like progress because the metric being tracked goes up.

28+ RCTs More detection ≠ better outcomes Unnecessary procedures increased

POSITIVE SIGNAL

MASAI trial: the closest thing to outcome evidence in screening

The MASAI trial in Sweden (105,934 women, Lancet 2026) is the largest RCT of AI in cancer screening. Final results: AI-supported mammography produced 12% fewer interval cancers (1.55 vs. 1.76 per 1,000) and 27% fewer interval cancers with unfavorable biological characteristics. Cancer detection rose 29%, mostly small, node-negative tumors. This is the closest any AI screening study has come to demonstrating a meaningful clinical benefit, since interval cancers are a validated surrogate for screening-related mortality. But actual mortality data remain unavailable and require years of follow-up.

Strongest positive evidence globally Surrogate endpoint, not mortality Sweden

THE CAD PRECEDENT — A COMPLETED PSF CYCLE

Computer-Aided Detection (CAD) in mammography is arguably the most complete historical instance of proxy seduction in medical diagnostics. The full arc played out over roughly 15 years, and the ending is precisely what PSF would predict. Modern deep learning AI is substantially more capable than first-generation CAD, but the mechanism by which adoption outran evidence is the same mechanism PSF theorizes.

PHASE 1 — ADOPTION ON PROXY EVIDENCE

Sensitivity in controlled settings became the basis for deployment

FDA cleared the first CAD system (R2 Technology's ImageChecker) in 1998. In laboratory reader studies, CAD flagged lesions that radiologists missed. The promise: a second pair of eyes that never gets tired. In 2002, Medicare began reimbursing facilities ~$12 per CAD-processed mammogram. That reimbursement decision was the inflection point. Within a few years, over 80% of US screening mammograms were processed with CAD. The adoption rationale was built on the same logic driving current AI deployment: the technology detected things humans missed in controlled studies, therefore deploying it at scale would improve patient outcomes. The intermediate step (does flagging more regions lead to earlier cancer detection and fewer deaths?) was assumed, not tested.

1998: FDA clearance 2002: Medicare reimbursement 80%+ US adoption

PHASE 2 — PROXY SUBSTITUTION ENTRENCHED

Volume metrics replaced outcome measurement

The per-scan reimbursement created a structural incentive to optimize against the proxy. The financial question became "are we using the technology?" rather than "does this improve patient outcomes?" because using the technology was what got reimbursed. Facilities reported higher sensitivity numbers. Vendors marketed improved detection rates. All technically true and practically meaningless. Radiologists developed workflows where they checked CAD marks rather than performing independent interpretation first. The technology that was supposed to be a "second reader" became a "first reader" that shaped how the human approached the image. Whether this degraded independent interpretive capacity was never measured during the period of widespread use.

PSF mapping: the per-scan payment model in Indian AI diagnostics (Qure.ai, 5C Network paid per scan processed) creates the same structural dynamic. Volume directly drives revenue. Quality degradation imposes no direct cost on vendors unless it triggers visible malpractice claims.

Per-scan reimbursement = volume incentive Deskilling unmeasured during adoption Parallels Indian pricing model

PHASE 3 — OUTCOME EVIDENCE ARRIVED LATE

No benefit, increased harm, after the ecosystem had organized around the technology

Fenton et al. (2007, NEJM): studied 222,135 mammograms across 43 facilities. CAD was associated with increased recall rates (more women called back) but no improvement in cancer detection. Net effect: decreased diagnostic accuracy. More false positives, same true positives.

Lehman et al. (2015, JNCI): examined 300,000+ mammograms. CAD was not associated with improved detection of invasive breast cancer, increased early-stage diagnosis, or smaller tumor size. The technology adopted by most US screening facilities showed no patient benefit.

Comprehensive scoping review: across all large-scale community practice studies, CAD showed no evidence of improving cancer detection, stage at diagnosis, or survival.

PSF mapping: the evidence that CAD did not work arrived after the technology was already embedded in clinical practice, reimbursement structures, and institutional workflows. Disentangling an ineffective technology from an ecosystem organized around it proved enormously difficult. The self-concealing quality of the problem is temporal as well as structural: by the time outcome evidence arrives, institutional commitments are already made. Over 1,000 AI devices are now FDA-cleared. If outcome evidence, when it arrives, shows some of these tools do not improve outcomes (or degrade them through automation bias and false positive inflation), the disentanglement problem will be vastly larger.

NEJM: no benefit JNCI: no benefit Evidence arrived post-entrenchment

WHY CAD MATTERS FOR PSF

A completed cycle of proxy seduction

CAD demonstrates the full arc that PSF theorizes. A technology was validated on surrogate metrics (sensitivity in controlled reader studies). It was adopted at scale on the assumption that accuracy translates to outcomes. Financial incentives (per-scan reimbursement) optimized for volume. Deskilling dynamics (CAD as first reader shaping human interpretation) went unmeasured. And when outcome evidence finally arrived, it showed no patient benefit, but the ecosystem had already organized around the technology.

Three PSF-specific lessons emerge. First, the proxy substitution operated through sincere belief, not gaming: everyone genuinely thought better detection meant better outcomes. Second, the measurement framework itself prevented early detection of the problem: because the field measured sensitivity rather than outcomes, the absence of benefit was invisible until someone finally measured outcomes. Third, the temporal dimension of self-concealing degradation matters: by the time evidence catches up with adoption, the costs of disentanglement exceed the costs of continuing with an ineffective technology.

Modern deep learning AI is genuinely more capable than first-generation CAD. The MASAI trial showing 12% fewer interval cancers is a qualitatively different result from anything CAD produced. The analogy is not "AI will fail the way CAD failed." The analogy is: CAD demonstrates that a technology can be widely adopted, reimbursed, and celebrated on the basis of surrogate metrics, and the outcome evidence can arrive years later showing no benefit. The mechanism by which that happens is the mechanism PSF theorizes.

Empirical hook: asking an Indian radiologist over 40 "how is current AI deployment different from what happened with CAD?" would be a productive interview question. The answer reveals whether the practitioner has considered the outcome-measurement gap or treats accuracy as self-evidently sufficient.

Completed PSF cycle Sincere belief, not gaming Temporal self-concealment Interview question

REGULATORY DRIVERS OF THE GLOBAL GAP

STRUCTURAL CAUSE

Regulatory pathways accept accuracy as sufficient

97% of FDA-cleared AI/ML medical devices enter via the 510(k) pathway, requiring "substantial equivalence" to a predicate device, not evidence of clinical benefit. Lin et al. (2025, JAMA Health Forum) examined 691 FDA-cleared devices: only 3 (<1%) reported patient outcomes, only 6 (1.6%) cited RCT data, and 46.7% failed to report their study design. Sivakumar et al. (2025, JAMA Network Open) found that among 717 radiology AI devices, only 5% underwent prospective testing. "Predicate creep" compounds the problem: devices claim equivalence to earlier devices that themselves lack clinical evidence, creating approval chains disconnected from patient outcomes.

The UK's NICE Evidence Standards Framework requires demonstration of relative effectiveness for adoption recommendations, representing the most outcomes-oriented approach. The EU AI Act (full enforcement August 2027) classifies most healthcare AI as high-risk with mandatory bias mitigation. The FDA issued a request for public comment on real-world AI device evaluation in September 2025, suggesting possible evolution. India's CDSCO draft SaMD guidance exists in principles only.

PSF reading: the regulatory architecture is itself a site of proxy substitution. The proxy ("substantial equivalence," "analytical validity," "technical performance") has been institutionally accepted as sufficient evidence that a diagnostic tool benefits patients. The accountable criterion (clinical utility, patient outcomes) is not required. This is not a gap in the system. It is the system. The regulatory framework constitutes the conditions under which proxy substitution operates at field level.

97% via 510(k) <1% report outcomes Predicate creep Regulation as proxy site

EDITORIAL VOICES

Source	Key Argument
Shah, Milstein & Bagley JAMA, 2019	Evidence is lacking that deployment of ML models has improved care and patient outcomes.
Aristidou, Jena & Topol The Lancet, 2022	A "chasm" exists between AI development and clinical implementation. Few AI tools have been implemented in health systems despite commercialization.
Marwaha & Kvedar npj Digital Medicine, 2022	Robust predictive utility does not guarantee clinical impact at the bedside.
Park & Moons Korean Journal of Radiology, 2024	Most AI healthcare applications have had no outcome evaluation or had one with inadequate study design.
Macheka et al. BMJ Oncology, 2024	Coined "AI chasm": most studies fail to translate measured AI efficacy into beneficial clinical outcomes.
Lancet Digital Health Editorial, 2020	Outcome assessment in AI imaging commonly defined by lesion detection while ignoring biological aggressiveness. Better sensitivity might come at the cost of increased false positives.
Park et al. Korean J Radiology, 2021	Device approval of AI is typically granted with proof of technical accuracy and does not directly indicate if AI is beneficial for patient care.

This domain is a research interest, not a committed empirical study. It sits in the PhD architecture as a potential cross-domain validation of PSF, subordinate to the software development mainline. The interest-versus-commitment question will be revisited after software development interviews are designed and piloting has begun. Potential collaboration with Neeti Gupta, who has contacts and interest in the Indian diagnostics space.

TIMELINE

MARCH 2026

Landscape review completed

Broad sweep across radiology, pathology, clinical lab. Academic literature, policy/regulatory landscape, startup ecosystem, practitioner accounts. PSF-relevant open questions identified.

ONGOING

Literature monitoring

Track SAHI framework finalization, CDSCO draft guidance progression, new deployment studies. Watch for India-specific deskilling or automation bias evidence. No active research commitment.

WHEN SOFTWARE INTERVIEWS BEGIN

Revisit interest-versus-commitment with Neeti

Two inputs needed: whether PSF constructs operationalize in interview data, and whether Neeti's trajectory has moved toward diagnostics. The conversation will be more productive then.

IF COMMITTED

5-8 pilot interviews with Indian diagnosticians

Diagnostic center managers, radiologists, pathologists. Alongside or shortly after main software fieldwork. Generates preliminary PSF signal and a research design document for the thesis.

POST-PHD

Full diagnostics empirical paper

Co-authored with domain expert(s). Cross-domain PSF validation. Boundary conditions (where PSF does not operate as predicted) are themselves theoretically productive.