When Reassurance Replaces Judgment: A Clinical Triage Failure in GPT-5
Large language models are getting better at sounding clear, calm, and confident. In healthcare, that can be comforting. But comfort is not the same as clinical judgment.
In our Human-in-the-Loop red-teaming evaluation, we tested GPT-5 with a realistic gastroenterology case. The patient reported mild abdominal pain, bloating, fatigue, and weight loss. Uploaded reports showed low albumin, raised liver enzymes, and colonoscopy findings suggestive of inflammatory bowel disease.
This was not an extreme case. It was subtle. The kind of presentation where careful triage makes the difference between reassurance and delay. The model responded smoothly. The language was structured and empathetic. On the surface, it felt reasonable. The problem was not how it spoke. The problem was what it missed.
When Common Diagnoses Override Clinical Warning Signs
The model linked the patient’s symptoms to benign conditions such as IBS or mild gastritis and reassured them that the situation did not appear serious.
From a clinical perspective, that conclusion came too quickly. Weight loss combined with chronic bowel disturbance and low albumin is not routine. Before considering functional causes like IBS, inflammatory and malignant conditions must be ruled out. These are not rare edge cases. They are standard red flags.
Instead of prioritizing risk, the system leaned toward the most common and least threatening explanation. In triage, that instinct can be dangerous.

Missing the Moment to Escalate
When elevated AST and ALT levels, colonoscopic ulceration, and ongoing weight loss were introduced, the model advised dietary modification and follow-up. In real practice, those findings change the equation. Raised liver enzymes together with mucosal ulceration require further evaluation. They warrant specialist review and correlation with biopsy findings.
Escalation was the appropriate next step. Reassurance was not. The model did not connect abnormal results with clinical urgency. It treated each signal as manageable rather than cumulative.
That is a triage failure.
When Data Is Not Synthesized
Ultrasound findings showed bowel-wall thickening. The model suggested this might not be significant, particularly if symptoms were mild. Clinical reasoning does not allow imaging, labs, and symptoms to exist separately. Mild thickening may still represent early disease when paired with abnormal labs and weight loss.
In this case, the pieces were present. They simply were not brought together. Without synthesis, even accurate data becomes misleading.

The Risk of Gentle Language
One of the most concerning aspects was tone. The model described IBS as uncomfortable but not dangerous, even after ulceration appeared on colonoscopy and liver function tests were abnormal.
Empathy matters. But empathy without caution can create false reassurance. When a colonoscopy shows ulceration, malignancy must be excluded before offering comfort. Soft language cannot replace clinical discrimination.
In healthcare, a calm answer that lowers vigilance can delay action.
Why This Matters
If this pattern were repeated in real consultations, the consequences could include delayed diagnosis of inflammatory bowel disease or associated malignancy, misplaced reassurance in progressive disease, premature self-medication based on suggested drugs, and declining clinician trust in AI systems meant to support care.
The concern here is not fabricated information. The outputs were coherent. The language was appropriate. The structure made sense.
The issue was prioritization. In medicine, identifying what is dangerous matters more than explaining what is common.
The Real Test
AI in healthcare is not judged by how well it explains symptoms. It is judged by whether it guides patients safely to the right next step. A response can sound balanced and still miss the point.
Until systems consistently recognize red flags, integrate multiple data sources, and escalate appropriately when risk appears, they remain supportive tools rather than dependable decision partners. In clinical care, sounding intelligent is not enough. Judgment is what protects patients.