When Fluency Masks Failure: Examining a Multimodal Reasoning Breakdown in GPT-5

When Fluency Masks Failure: Examining a Multimodal Reasoning Breakdown in GPT-5

AI in healthcare does not fail loudly. It does not throw an error message. It does not say “I don’t know.” It speaks calmly. It explains confidently. It sounds clinical. And sometimes, it is completely wrong.

As part of our Human-in-the-Loop red-teaming evaluations at iCliniq, we test models in real consultation-style environments. Multi-turn conversations. Uploaded scans. Follow-up questions. Context that shifts.

This case looked simple:

One-sided tooth pain.One uploaded dental X-ray.One clear task: connect the history to the image and suggest next steps.

GPT-5 started strong. Then the reasoning started slipping.

When the Pain Was on the Left, but the Diagnosis Was on the Right

The patient clearly described pain on one side. GPT-5 confidently identified infection on the opposite side of the X-ray. A few turns later, it quietly flipped sides and rewrote the narrative to make the correction feel seamless.

There was no acknowledgement of mistake. No pause. No recalibration. Just smooth storytelling.

In medicine, laterality is not cosmetic. Right versus left determines procedures, prescriptions, and sometimes surgery. A system that drifts between sides without recognizing it is not reasoning. It is improvising. And improvisation in clinical care is dangerous.

Diagnosing an Infection That Did Not Exist

Then came the confident radiology. GPT-5 described a dark area near the root tip and labeled it as an infection spreading into bone. The X-ray did not clearly show root apices. Normal sinus anatomy was interpreted as an abscess. Bone density that appeared intact was described as compromised.

It was not cautious language. It was assertive. This is what makes multimodal errors risky. When text and image do not align, the model fills the gap with plausible fiction.

It did not say, “The image is unclear.” It did not say, “This cannot be confirmed.” It saw pathology where there was none and explained it fluently. That fluency is what makes the error believable

Urgent. Not Urgent. Urgent Again.

At one point, the model advised urgent dental care if symptoms worsened. Later, it reassured the patient that the issue was not immediately dangerous and could be monitored. The clinical stance shifted. The reasoning behind that shift did not.

This is multi-turn memory drift. As the conversation grows longer, earlier statements lose structural weight. The model adjusts tone and recommendation without fully reconciling what it previously claimed. For a patient, that creates confusion. For a healthcare system, that creates risk. Consistency is not optional in medical reasoning.

When Sounding Thorough Replaces Actually Deciding

Eventually, GPT-5 listed multiple possibilities. Reversible pulpitis. Cracked tooth. Early abscess formation. Each with different management. It looked comprehensive. But it did not narrow the diagnosis based on available evidence. It expanded uncertainty instead of resolving it.

This is the clarity trap. The model sounds responsible because it names options. But without prioritization or evidence-based narrowing, it is distributing risk rather than reasoning through it.

In real clinical care, doctors do not just list possibilities. They weigh them. That weighing never fully happened.

The Real Problem Is Not the Mistake

Mistakes are expected. Every system makes them. What matters is how the mistake unfolds. In this case, GPT-5:

  • Lost laterality.
  • Invented radiographic pathology.
  • Shifted urgency without clear justification.
  • Expanded differentials instead of refining them.

And it did all of it while sounding structured, empathetic, and clinically polished. It did not break in obvious ways. It drifted. That drift is what we measure in our HITL red-teaming evaluations.

Why This Changes How We Evaluate AI

We do not use trick prompts. We do not set traps. We simulate real consultations. Ambiguity. Imaging. Follow-ups. Emotional nuance. Cognitive load. Because the most dangerous AI failures are not wildly incorrect answers. They are technically plausible explanations that quietly mislead.

Until large language models can maintain stable cross-modal reasoning, consistent multi-turn memory, and evidence-anchored conclusions, their role in medicine must remain assistive. Not authoritative. AI can sound like a dentist. But until it reasons like one under pressure, it cannot be trusted like one.

Also Read