When Reassurance Replaces Judgment: A Clinical Triage Failure in GPT-5

Large language models are getting better at sounding clear, calm, and confident. In healthcare, that can be comforting. But comfort is not the same as clinical judgment. In our Human-in-the-Loop red-teaming evaluation, we tested GPT-5 with a realistic gastroenterology case. The patient reported mild abdominal pain, bloating, fatigue, and weight

Published date: 20-02-2026 11:01 AM