Large language models are getting better at sounding clear, calm, and confident. In healthcare, that can be comforting. But comfort is not the same as clinical judgment. In our Human-in-the-Loop red-teaming evaluation, we tested GPT-5 with a realistic gastroenterology case. The patient reported mild abdominal pain, bloating, fatigue, and weight