Why Prompt Engineering Isn’t Enough: The Hidden Risks of AI in Early Pregnancy Triage

Why Prompt Engineering Isn’t Enough: The Hidden Risks of AI in Early Pregnancy Triage

When AI Sounds Right but Thinks Differently

AI today can summarize medical guidelines, explain conditions, and answer health questions in fluent and confident language. It often sounds like a trained doctor. That confidence is impressive, but in healthcare, sounding right is not the same as being safe.

As part of a human-in-the-loop red-teaming evaluation, we tested a large language model in early pregnancy triage scenarios using a clinician-designed safety framework. We presented the same medical situation in different ways, including hopeful, fearful, analytical, and step-by-step reasoning prompts. 

The clinical facts remained unchanged. Yet the AI’s prioritization, escalation timing, and decision patterns shifted. This was not a minor inconsistency. It revealed a safety-critical gap in clinical judgment.

Why Clinical Judgment Is More Than Medical Knowledge

In medicine, knowing the right facts is only part of safe care. True safety comes from stable judgment, especially when information is incomplete or when patients express fear, hope, or uncertainty.

A patient’s emotional tone does not change the underlying medical risk. Clinicians anchor decisions to physiology and probability, not to how a question is phrased. Most AI evaluation systems assume that if a model performs well under one prompt, it will perform similarly under slightly different ones. 

In healthcare, this assumption can fail. Even small changes in wording or emotional context can shift how AI prioritizes risk.

The Scenario: Early Pregnancy and Abdominal Pain

We focused on a patient at around six weeks of pregnancy presenting with abdominal pain. From a clinical standpoint, this situation always requires ruling out ectopic pregnancy before offering reassurance.

Ectopic pregnancy is uncommon, but it is life-threatening if missed. Early symptoms can be subtle. Delayed recognition can lead to internal bleeding, infertility, or death. Because of this, triage protocols treat ectopic pregnancy as a must-not-miss diagnosis. 

Our goal was not to test whether the AI knew this fact. It did. The goal was to test whether its judgment stayed consistent when the same case was framed differently.

Same Medical Facts, Different Question Styles

We presented the same scenario in four ways. One prompt expressed hope that nothing serious was happening. Another expressed fear that something was wrong. A third asked for a list of possible causes. A fourth requested step-by-step clinical reasoning.

We observed how the AI handled risk prioritization, timing of escalation, ordering of diagnoses, and whether reassurance appeared before dangerous conditions were excluded. The medical reality never changed. Only the phrasing and emotional framing did.

How the AI Responded

When the prompt carried a hopeful tone, the AI leaned toward benign explanations. It reassured early and delayed urgent evaluation, even though the clinical risk was unchanged.

When the prompt expressed fear, the AI showed more empathy and reassurance. However, urgency did not consistently increase to match the actual risk. Emotional validation sometimes softened triage clarity.

When asked to list causes, the AI expanded across many diagnoses. High-risk and low-risk conditions appeared side by side, flattening the risk hierarchy.

When asked to reason step by step, the AI produced structured explanations. Yet escalation came after long reasoning rather than before, meaning life-threatening conditions were addressed later than clinically appropriate.

Across all versions, medical knowledge was present. What varied was clinical judgment. The ability to prioritize danger, escalate early, and sequence decisions safely shifted with surface-level prompt changes.

Why Prompt Engineering Cannot Fix This Alone

Careful prompting can improve how an AI responds on the surface. It cannot guarantee stable safety behavior. A system that performs safely only under ideal phrasing is not safe for real-world healthcare.

Benchmarks typically measure correctness and guideline recall. They do not test how models behave when emotional tone shifts, when phrasing changes, when information is incomplete, or when urgent triage is required. As a result, models can appear reliable in controlled settings while remaining fragile in real clinical conditions.

What This Means for Healthcare AI

Clinical safety depends on consistent prioritization of risk, not just fluent explanations. Emotional sensitivity without strong risk anchoring becomes dangerous. Prompt engineering cannot substitute for clinical judgment. Healthcare AI must be evaluated for decision stability, escalation timing, and harm prevention under realistic variation, not only for answer accuracy.

In medicine, being mostly right is not enough. Safety requires being right at the right moment.

Also Read