We've all had a chuckle about the occasional hallucination by generative AI: the time when it recommended using glue to keep cheese from sliding off a piece of pizza; when an Air Canada chatbot promised a passenger a bereavement fare despite a policy to the contrary, and the airline had to live up to that promise; when a lawyer unknowingly submitted a brief to a judge that was based on citations from court cases that never happened; and so on.
But a couple of recent stories go well beyond the chuckle level. While generative AI continues to show all the promise in the world, these stories demonstrate consistent problems that, unchecked, would lead to severe consequences.
Let's start with the one about war games in which large language models almost always recommended escalating to nuclear weapons.
As Axios reports, a researcher at Kings College London pitted three popular LLMs — GPT-5.2, Claude Sonnet 4 and Gemini 3 Flash — against each other in 21 war games in which the AIs acted as the leaders of major nations. The scenarios included threats to survival, but also included lower-stakes conflicts, such as border skirmishes and resource competition. Yet 95% of the time, at least one of the LLMs "used" nuclear weapons, and escalation typically ensued.
For anyone without a strong Dr. Strangelove streak, those results reflect a scary misjudgment. While the U.S. and the Soviet Union considered tactical nuclear weapons to be legitimate parts of their arsenals in the early years of the nuclear age, those were also the times when the countries casually considered using nuclear weapons for industrial uses such as mining and natural gas extraction. It's been clear for decades that nuclear weapons are simply too powerful for their effects to be limited to legitimate military or industrial targets.
Even at one kiloton, the smallest payload for what's considered a tactical nuclear weapon, the explosion would be 100 times as powerful as the biggest conventional bomb in the U.S. arsenal. At the top end of the range for a tactical nuclear weapon (generally considered to be 100 kilotons), the explosion would be some seven times as powerful as the bomb dropped on Hiroshima, which destroyed a military target but also killed an estimated 140,000 people, the vast majority of them civilians. The radiation released can also reach far beyond the targeted area.
While the Kings College researcher noted that no one is handing AIs the keys to nuclear weapons systems, he said, "Militaries are already using AI for decision support — and research suggests those systems may lean into rapid escalation under pressure."
The other article that caught my eye relates to ChatGPT Health. The app, launched in January, is consulted by some 40 million people every day — and a study found the potential for major problems with the app's diagnoses. For more than half of the study's hypothetical patients who should have sought immediate medical care, ChatGPT Health told them they should stay home or wait to schedule a regular appointment with a doctor.
The article, in the Guardian, said: "In one of the simulations, eight times out of 10 (84%), the platform sent a suffocating woman to a future appointment she would not live to see.... Meanwhile, 64.8% of completely safe individuals were told to seek immediate medical care."
For the study, published in the journal Nature Medicine, researchers created 60 realistic patient scenarios covering health conditions from mild illnesses to emergencies, then presented those scenarios to ChatGPT Health in various ways: changing the gender of the patient, sometimes providing test results, sometimes adding comments about what "friends" advised, etc. Three independent doctors reviewed each scenario and agreed on the level of care needed, based on clinical guidelines.
The study found that ChatGPT Health did well on textbook emergencies such as stroke and severe allergic reactions. But "'what worries me most,'" a doctor is quoted as saying in the article, "'is the false sense of security these systems create. If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life.'”
Any number of health experts have extolled the potential for AI-based health advice, coupled with wearables and telemedicine, to revolutionize healthcare — providing care to the elderly and to people in rural areas, who would otherwise have difficulty getting access, while slowing the inexorable rise in healthcare costs. And I've bought in: Chunka Mui, Tim Andrews and I included a lengthy scenario about the potential for AI-based healthcare in our 2021 book, "A Brief History of a Perfect Future."
I still think the potential is there, too. As OpenAI, the developer of ChatGPT, told the Guardian, the app is updated and improved all the time, and I hope they keep charging ahead. (OpenAI also said it doesn't believe the study reflects how people actually use ChatGPT Health.)
But I also hope they are constantly checking for problems such as those identified in the study, and anyone else using AI in situations with major consequences should exercise similar care. That includes insurers, and not just in healthcare. As we feel our way toward using AI agents, we need to be very careful to not only vet them before putting them into production but to then supervise them — because they absolutely will make mistakes — and to keep improving them.
Cheers,
Paul
