AI Needs an 'I Don't Know' Feature

Insurance AI scales when it defers to human expertise and flags uncertainty, not when it claims to automate everything.

Human and AI Fingers Touching

The AI that survives contact with insurance production isn't the one that claims to handle everything. It's the one that defers to human experts on configuration — and tells them when it isn't sure.

It's 4:45 p.m. on a Friday. An underwriter is staring at a 50-page submission that just landed in her inbox. Half the fields on the broker's application are blank, the loss runs are scanned PDFs with blurry text, and the broker's note says if she can quote it before Monday the business is hers.

A few years ago, that submission either gets the rest of her week or it gets a polite "thanks but no thanks." Today, in a well-run shop, an AI document pipeline pre-processes the package overnight. By Monday morning the structured fields are sitting in her workflow tool, except for three highlighted fields in yellow, where the model wasn't sure. She spends 20 minutes verifying those three against the source PDFs, corrects one, accepts the others, and quotes the deal before lunch.

That highlighted-in-yellow moment is the entire game. It's the difference between AI that gets adopted and AI that gets quietly abandoned six months in. And it doesn't come from the model being smarter. It comes from two design choices most vendors are reluctant to lead with: the AI defers to a human expert on protocol and configuration, and it tells that human when it isn't confident in its own answer.

I've been building document AI workflows for insurance carriers, MGAs, and reinsurers for several years. The pattern that separates the systems that scale from the ones that get shelved isn't subtle. The systems that scale behave like an apprentice. The ones that get shelved behave like an oracle.

The apprentice mindset

Nobody hands a first-year underwriter the keys to a renewal book on day one. The new hire shadows a senior, learns the carrier's appetite, sees how the desk handles a tricky loss run, and runs every recommendation past someone with 20 years of context before it goes out the door. The expectation isn't that the apprentice arrives knowing everything, it's that they get faster, more accurate, and more independent through repeated cycles of review and correction.

That's what AI in insurance needs to look like. Not a system you switch on, but a system you train, configure, and refine with human expertise as the central input.

The work that this requires is real, and most vendors undersell it. The carrier has to define which document types matter, which fields the model needs to extract from each, which business rules govern acceptance, where the human handoff points sit, and what the escalation path looks like when the model is unsure. None of that is the AI's job. It's the expert's job. AI is only as useful as the configuration the experts give it.

This is one reason BCG found that 77% of insurance carriers are piloting AI but only 7% have scaled it. The model accuracy on isolated test sets is rarely the bottleneck. The configuration work, the continuing protocol-setting, review, and refinement that turns a generic model into a trusted production tool, is what most programs underinvest in. The pilot will look great in a sandbox but quietly underperform in the wild.

The "I don't know" feature

Most modern AI tools were built first for consumer use cases, where a confident wrong answer is a small cost. In insurance, a confident wrong answer is a mispriced policy, a wrongly denied claim, or a compliance exposure that surfaces 18 months later when a regulator asks how a decision got made.

That changes the design priority. A production-grade insurance AI doesn't just need to be accurate — it needs to know when it isn't. Field-level confidence scoring isn't a nice-to-have feature; it's the trust infrastructure that makes the whole apprentice model work.

When the system can tell a reviewer "I extracted this date of birth with 99% confidence and this loss history with 62%," three things change. Review goes from "re-read everything to catch errors" to "check the flagged ones" — the only review pattern that actually saves human time at scale. Reviewers build calibrated trust over time, learning which extractions they can skim and which need a careful look at the source. And the system produces an audit trail that regulators are increasingly going to require. The NAIC Model Bulletin on AI, New York DFS Circular Letter No. 7, Colorado's Regulation 10-1-1, and the EU AI Act's high-risk obligations all share a through-line: AI used in insurance decisions must include meaningful human oversight, and carriers must be able to show their work. A system that flags its own uncertainty produces that documentation natively. A system that doesn't is one your compliance team will spend a year retrofitting.

One important caveat. Confidence scores are only useful if they're calibrated — meaning when the model says it's 90% confident, it should actually be right roughly 90% of the time. An overconfident model with a meaningless score is worse than no score at all, because it teaches reviewers to ignore the signal. That's something buyers should test for, not assume.

What carriers should ask before they buy

Most RFPs for insurance AI ask the wrong questions. They focus on benchmark accuracy, model size, and end-to-end automation claims — easy questions to answer in a slide deck, but not the ones that predict whether the system will still be in production a year from now.

The questions that predict adoption are about the apprentice posture. Can the system expose field-level confidence scores, and are they calibrated against actual accuracy? Can our experts configure new document types and business rules without filing a vendor ticket? When a reviewer corrects the model's output, does the system actually learn from the correction or just log it? Does the human review queue route work by confidence level, or dump everything into one bucket?

The carriers that ask these questions tend to end up with AI that gets used. The carriers that buy on autonomy claims tend to end up in the 60% of organizations that, per BCG, generate no material value from their AI investment despite continued spending.

The bet worth making

The right AI for insurance isn't the one that claims to do everything. It's the one that knows the expert is still in charge — and acts like it. It asks the expert for configuration. It defers when it isn't sure. It gets better when it's corrected. That posture isn't a limitation of the technology. It's the reason the technology survives contact with production.

The Friday afternoon submission is going to keep arriving. The question is whether your AI is going to help your underwriter quote it by lunch on Monday — or just give her a different kind of mess to clean up.


Sam Gobrail

Profile picture for user SamGobrail

Sam Gobrail

Sam Gobrail is the U.S. head of delivery and solutions at Upstage.

Before Upstage, he led transformation programs for Fortune 100 companies and federal agencies. Earlier, he practiced law and managed multimillion-dollar federal portfolios.

Gobrail holds a juris doctor and MBA from American University, where he also teaches.

Read More