Failure File™

UnitedHealthcare AI Claim Denials: When the Algorithm Overrules the Doctor

#FailureFiles #AIGovernance #UnitedHealthcare #HealthcareAI #AlgorithmicHarm #HumanSignal

Listen on The AI Governance Briefing


The doctor said the patient needed more time. The algorithm said no.

According to federal lawsuits filed against UnitedHealthcare in 2023, the company deployed an AI model called nH Predict — developed by its subsidiary NaviHealth — to make post-acute care determinations for Medicare Advantage enrollees. Post-acute care is what happens after hospitalization: nursing facility stays, rehabilitation, skilled nursing. It is medically necessary, physician-certified, and for elderly patients, often the difference between recovery and decline.

The lawsuits alleged that nH Predict was used to deny these claims at a rate dramatically higher than the historical standard — and that it routinely overrode physician determinations even when doctors had certified medical necessity. Patients were discharged prematurely. Some of them died.

This is not a technology failure story. The model did what it was designed and deployed to do. This is a governance failure story — about what happens when an AI system is placed in a life-or-death decision context without the validation, accountability, and human override structures that context demands.

Incident Summary

  • UnitedHealthcare's subsidiary NaviHealth developed nH Predict, an AI model trained on patient data to generate post-acute care length-of-stay predictions
  • The model was used in Medicare Advantage claim determinations — decisions about whether to authorize continued nursing facility or rehabilitation care for hospitalized elderly patients
  • According to federal lawsuits, the model denied claims at a rate exceeding 90% in some categories, dramatically above the industry historical average
  • Claims were denied even when treating physicians had certified ongoing medical necessity — the AI determination was allowed to override clinical judgment without adequate human review
  • Multiple patients who were denied coverage and discharged died within days or weeks of discharge
  • In November 2023, a class-action lawsuit was filed in the Southern District of New York on behalf of patients and estates
  • The case triggered scrutiny from CMS, Senate investigation, and accelerated federal legislative action on algorithmic insurance decisions

The Allegation

"UnitedHealthcare used an algorithm it knew to have a 90% error rate to systematically deny care to elderly Medicare Advantage patients — overriding physician determinations in order to cut costs." — Federal class-action complaint, 2023

What nH Predict Actually Did

nH Predict was trained on aggregate patient data to predict how long patients typically stayed in post-acute care facilities. That is the function it was built for: pattern recognition across a population to produce a length-of-stay benchmark.

The governance failure was not the model's prediction. The governance failure was the deployment decision: using a population-average statistical model as the primary basis for overriding individual physician medical necessity determinations in a clinical context where getting it wrong means a patient goes home too early and dies.

These are categorically different functions. A model that predicts the average length of stay for a population of patients is not a model that has been validated to determine whether a specific patient with a specific clinical profile requires additional care. Deploying it as if it were the latter — without validation, without a documented accuracy threshold in the clinical decision context, and without a meaningful human override mechanism — is a governance failure of the first order.

"The model was never the problem. The problem was placing it in a decision context it was never validated for — and removing the human override that should have caught its errors before patients were harmed." — Dr. Tuboise Floyd

"A population average is not a clinical determination. Deploying it as one — without validation — is a governance failure, not a model limitation."

Governance Control Analysis

The UnitedHealthcare failure operates at two governance levels simultaneously. The first is the deployment decision itself: placing an AI model in a high-stakes clinical context without validating it for that context. The second is the accountability structure that allowed AI denials to override physician determinations at scale without a meaningful check.

At the GOVERN level, the failure is structural. There was no accountability mechanism that required nH Predict's outputs to be reviewed against physician determinations before a denial was issued. No defined role owned the question: "Is this model producing medically defensible denials in this specific context?" No escalation path existed that a physician could trigger when the AI determination contradicted their clinical assessment. The accountability gap was not incidental — it was the mechanism by which the denial rate became possible.

At the MEASURE level, the failure is technical and ethical simultaneously. A model trained on historical population averages carries implicit bias toward the mean. Patients whose medical complexity, comorbidities, or social circumstances placed them above the statistical average for recovery time were systematically disadvantaged by a model that had never been assessed for performance across those populations. That is a MEASURE 2.11 failure — bias and fairness not evaluated before deployment in a clinical decision context — compounding a MEASURE 2.5 failure — model validity never demonstrated for individual clinical determinations.

TAIMScore™ Diagnostic

Scored against the TAIMScore™ framework, the UnitedHealthcare AI denial failure implicates four controls across GOVERN and MEASURE domains:

GOVERN
1.1

Accountability Structure

No defined accountability structure governed the conditions under which nH Predict's output could override a physician's medical necessity determination. An AI denial in a Medicare Advantage context is a consequential institutional decision — it requires an accountability structure that names who owns the denial, what standard they must apply, and what threshold of disagreement between the AI and the physician requires human adjudication. That structure did not exist.

GOVERN
2.2

Accountability & Training Controls

No training or accountability controls governed how staff were expected to use nH Predict in relation to clinical determinations. If the model's output was presented as a ceiling on authorized care without guidance on when and how to override it, the governance failure is organizational — not individual. GOVERN 2.2 requires that the people using AI in consequential decisions understand both the system's limitations and their authority to override it.

MEASURE
2.5

Validity in Deployment Context

nH Predict was trained on aggregate historical patient data. It was never demonstrated to be valid for individual clinical determinations in the post-acute care context. MEASURE 2.5 requires that a model be shown to perform reliably in the specific context where it will be used — not just in the training distribution. A population-average model used to make individual medical necessity decisions is operating outside its validated scope. That is a MEASURE 2.5 failure at the point of deployment decision.

MEASURE
2.11

Bias & Fairness Evaluation

No documented bias and fairness evaluation was conducted before nH Predict was used in clinical denial decisions. Models trained on historical insurance data embed the systematic inequities of that data — including disparities in care access, documentation practices, and treatment patterns across demographic groups. MEASURE 2.11 requires that fairness be evaluated before deployment in a context where the model's output affects access to care. It was not. Patients whose conditions deviated from the historical mean bore the cost of that omission.

Structural Lessons

The UnitedHealthcare case is the clearest available example of what the GASP™ framework identifies as Governance As a Structural Problem: the institution did not fail because it deployed bad technology. It failed because the governance structure surrounding the technology could not stop a population-average statistical model from being used to override individual physician determinations at scale, without accountability, in a context where the error rate was lethal.

The structural lesson is not about healthcare AI specifically. It is about the decision sequence every institution must execute before deploying AI in any high-stakes context. That sequence has a name in the GASP™ framework: who owns the decision, what is the escalation path, and what accountability exists without the vendor. In the UnitedHealthcare case, the answers to all three questions were either absent or structured to insulate the denial from challenge.

"Most institutions will not fail because of a bad AI model. They will fail because of a broken governance structure around it." — Human Signal Driving Thesis

The second structural lesson is about the word "override." Any AI system deployed in a context where its output can override a human expert's determination — without a documented validation establishing that the model outperforms that expert in that specific context — is a governance failure waiting to be litigated. The physician's medical necessity determination is not an obstacle to efficient claims processing. It is the primary information input. An AI model that contradicts it, without being validated to do so, is not governance. It is automation of denial.

For federal agencies, CMS contractors, VA health systems, TRICARE, and any organization deploying AI in benefits determination, claims processing, or clinical decision support: the UnitedHealthcare lawsuits are not the end of this story. They are the beginning of the regulatory response. The governance structures that will be required are already visible in the complaints. Build them now or have them mandated later.

The Question Your Institution Must Answer

If your organization deploys any AI system that generates outputs used in decisions affecting individual access to services, benefits, care, or entitlements, answer this question before the next deployment review:

Has this model been validated — not trained, validated — to perform reliably for the specific decision context in which we are deploying it? And who owns the accountability when its output contradicts the expert judgment of the professional closest to the case?

If you cannot answer both parts of that question with a named person, a documented accuracy threshold, and an override protocol, you have a GOVERN 1.1 and MEASURE 2.5 gap. UnitedHealthcare's exposure is the cost of those gaps at scale. Your institution's version of that cost depends on what you deploy next.


Apply the Framework

Failure Files™ Hub — All 12 cases scored against TAIMScore™ GOVERN, MAP, MEASURE, and MANAGE controls. The healthcare and federal sectors are represented across multiple cases.

→ All Failure Files™ → TAIMScore™ Assessor Workshop

GASP™ — Governance As a Structural Problem. The diagnostic framework for identifying where your institution's governance structure breaks down before AI deployment — not after a lawsuit forces the question.

→ Read GASP™ → The Workflow Thesis → ✦ Underwrite Human Signal

Related Failure Files™

Air Canada Chatbot — When Your AI Invents Policy Zillow iBuying Collapse — $881M in Losses and a MAP Control That Was Never Built The Anthropic Exodus and Governance Collapse