Accuracy isn't trust

On

January 15, 2026

by

Jay Parkinson

Topics

Doctors

AI

When we talk about healthcare AI, we keep using the word "accuracy." We benchmark it, we publish it, we market it. But accuracy isn't what doctors are actually evaluating when they determine their thoughts and feelings about an AI tool. They're evaluating whether they can trust it. These sound like the same thing, but they're not.

Years ago, I spoke at a conference at Mayo Clinic. The emcee was John, a morning radio host in New York City. He'd been invited because he uses a wheelchair (he's paralyzed from the waist down) and could speak to the healthcare system as someone who lives inside it, not just observes it. John lives in Red Hook, Brooklyn, and every weekday he takes the subway into lower Manhattan for work, in a wheelchair, through a transit system built decades before anyone thought about accessibility.

After the conference, we shared a cab from Rochester to Minneapolis to catch our flights home. I asked him what app he used to navigate the subway. He told me he doesn't use one. He'd tried them all, but he'd been burned so many times that he stopped trusting any of them. The apps would show a station entrance as wheelchair-accessible when it wasn't. If he goes to a station and the accessible entrance doesn't exist, he has to ride to the end of the line to find one that works. He can't afford that kind of delay because he has to be on air at 6am. So I asked what error rate would be acceptable. He thought about it and said, "If something is wrong 5% of the time, that's too much."

I've thought about that conversation for years. Not because 95% is some magic number, but because of what John was actually telling me. The problem wasn't the apps' overall accuracy. The problem was that he couldn't predict when they'd be wrong. The failures were random. And random failures, even rare ones, meant he couldn't build his life around the tool. So he built his life around what he'd personally verified instead: routes he'd tested, stations he knew. The apps might have been accurate 95% of the time in aggregate, but he couldn't trust them.

This is where physicians find themselves right now with clinical AI. The tools work like magic. They're also confidently wrong in ways that are hard to detect. And doctors can't tell which situation they're in. Here's an analogy. You're a pilot. You've been assigned a co-pilot who has a drinking problem and is sober about 70% of the time. You don't get to choose whether you fly with him. If his drinking is random, you can't rely on him for anything that matters. Every flight, you're basically flying alone, except now you also have to watch whether he's about to do something dangerous. He's not helping you. He's adding to your workload. But what if you could predict it? What if you learned the tells, the signs that today is a bad day, the tasks you can hand off versus the ones you need to keep? Now you have something to work with. Not a perfect co-pilot, but one whose limitations you understand. That's the difference between accuracy and trust. A tool that's wrong 10% of the time in ways you can anticipate is more useful than a tool that's wrong 5% of the time at random.

So what does this mean for people building clinical AI? There are two jobs. The first is engineering: making the model better, reducing errors, essentially, getting the co-pilot into recovery. That work is essential, but it's not enough. The second job is building an understanding: helping engineers and physicians see when the tool is likely to fail, why it fails in those situations, and what that means for how doctors should use it. Most AI companies know their benchmark scores, but they often can't tell you the clinical contexts where their model struggles, the types of cases where it's overconfident, the situations where it's reliable enough to lean on. Without that map, physicians have to build their own through trial and error. Some will get burned and stop using the tool entirely, like John with his subway apps. Others will over-trust and miss the moments when it's confidently wrong.

Physicians don't think in benchmarks. They think in trust. Can I rely on this? When should I double-check? What's it good at? Where does it fall apart? A tool can be highly accurate and still not trusted because physicians can't see its limitations. A tool can be imperfect and still trusted because physicians know exactly when to rely on it and when to use their own judgment. The companies that will become physician favorites are the ones who can say: "Here's when our tool is strong. Here's where it struggles. Here's how to use it well." That's not marketing. It's what trust requires.

Are you a doctor interested in the future of healthcare?

Learn about the work

Curious to see how Automate.clinic can help your model accuracy?

How we help AI companies

This is the latest post.

Announcing Automate Clinic’s latest funding round ->

Accuracy isn't trust

On

January 15, 2026

by

Jay Parkinson

Topics

Doctors

AI

When we talk about healthcare AI, we keep using the word "accuracy." We benchmark it, we publish it, we market it. But accuracy isn't what doctors are actually evaluating when they determine their thoughts and feelings about an AI tool. They're evaluating whether they can trust it. These sound like the same thing, but they're not.

Years ago, I spoke at a conference at Mayo Clinic. The emcee was John, a morning radio host in New York City. He'd been invited because he uses a wheelchair (he's paralyzed from the waist down) and could speak to the healthcare system as someone who lives inside it, not just observes it. John lives in Red Hook, Brooklyn, and every weekday he takes the subway into lower Manhattan for work, in a wheelchair, through a transit system built decades before anyone thought about accessibility.

After the conference, we shared a cab from Rochester to Minneapolis to catch our flights home. I asked him what app he used to navigate the subway. He told me he doesn't use one. He'd tried them all, but he'd been burned so many times that he stopped trusting any of them. The apps would show a station entrance as wheelchair-accessible when it wasn't. If he goes to a station and the accessible entrance doesn't exist, he has to ride to the end of the line to find one that works. He can't afford that kind of delay because he has to be on air at 6am. So I asked what error rate would be acceptable. He thought about it and said, "If something is wrong 5% of the time, that's too much."

I've thought about that conversation for years. Not because 95% is some magic number, but because of what John was actually telling me. The problem wasn't the apps' overall accuracy. The problem was that he couldn't predict when they'd be wrong. The failures were random. And random failures, even rare ones, meant he couldn't build his life around the tool. So he built his life around what he'd personally verified instead: routes he'd tested, stations he knew. The apps might have been accurate 95% of the time in aggregate, but he couldn't trust them.

This is where physicians find themselves right now with clinical AI. The tools work like magic. They're also confidently wrong in ways that are hard to detect. And doctors can't tell which situation they're in. Here's an analogy. You're a pilot. You've been assigned a co-pilot who has a drinking problem and is sober about 70% of the time. You don't get to choose whether you fly with him. If his drinking is random, you can't rely on him for anything that matters. Every flight, you're basically flying alone, except now you also have to watch whether he's about to do something dangerous. He's not helping you. He's adding to your workload. But what if you could predict it? What if you learned the tells, the signs that today is a bad day, the tasks you can hand off versus the ones you need to keep? Now you have something to work with. Not a perfect co-pilot, but one whose limitations you understand. That's the difference between accuracy and trust. A tool that's wrong 10% of the time in ways you can anticipate is more useful than a tool that's wrong 5% of the time at random.

So what does this mean for people building clinical AI? There are two jobs. The first is engineering: making the model better, reducing errors, essentially, getting the co-pilot into recovery. That work is essential, but it's not enough. The second job is building an understanding: helping engineers and physicians see when the tool is likely to fail, why it fails in those situations, and what that means for how doctors should use it. Most AI companies know their benchmark scores, but they often can't tell you the clinical contexts where their model struggles, the types of cases where it's overconfident, the situations where it's reliable enough to lean on. Without that map, physicians have to build their own through trial and error. Some will get burned and stop using the tool entirely, like John with his subway apps. Others will over-trust and miss the moments when it's confidently wrong.

Physicians don't think in benchmarks. They think in trust. Can I rely on this? When should I double-check? What's it good at? Where does it fall apart? A tool can be highly accurate and still not trusted because physicians can't see its limitations. A tool can be imperfect and still trusted because physicians know exactly when to rely on it and when to use their own judgment. The companies that will become physician favorites are the ones who can say: "Here's when our tool is strong. Here's where it struggles. Here's how to use it well." That's not marketing. It's what trust requires.

Are you a doctor interested in the future of healthcare?

Learn about the work

Curious to see how Automate.clinic can help your model accuracy?

How we help AI companies

This is the latest post.

Announcing Automate Clinic’s latest funding round ->

Accuracy isn't trust

On

January 15, 2026

by

Jay Parkinson

Topics

Doctors

AI

When we talk about healthcare AI, we keep using the word "accuracy." We benchmark it, we publish it, we market it. But accuracy isn't what doctors are actually evaluating when they determine their thoughts and feelings about an AI tool. They're evaluating whether they can trust it. These sound like the same thing, but they're not.

Years ago, I spoke at a conference at Mayo Clinic. The emcee was John, a morning radio host in New York City. He'd been invited because he uses a wheelchair (he's paralyzed from the waist down) and could speak to the healthcare system as someone who lives inside it, not just observes it. John lives in Red Hook, Brooklyn, and every weekday he takes the subway into lower Manhattan for work, in a wheelchair, through a transit system built decades before anyone thought about accessibility.

After the conference, we shared a cab from Rochester to Minneapolis to catch our flights home. I asked him what app he used to navigate the subway. He told me he doesn't use one. He'd tried them all, but he'd been burned so many times that he stopped trusting any of them. The apps would show a station entrance as wheelchair-accessible when it wasn't. If he goes to a station and the accessible entrance doesn't exist, he has to ride to the end of the line to find one that works. He can't afford that kind of delay because he has to be on air at 6am. So I asked what error rate would be acceptable. He thought about it and said, "If something is wrong 5% of the time, that's too much."

I've thought about that conversation for years. Not because 95% is some magic number, but because of what John was actually telling me. The problem wasn't the apps' overall accuracy. The problem was that he couldn't predict when they'd be wrong. The failures were random. And random failures, even rare ones, meant he couldn't build his life around the tool. So he built his life around what he'd personally verified instead: routes he'd tested, stations he knew. The apps might have been accurate 95% of the time in aggregate, but he couldn't trust them.

This is where physicians find themselves right now with clinical AI. The tools work like magic. They're also confidently wrong in ways that are hard to detect. And doctors can't tell which situation they're in. Here's an analogy. You're a pilot. You've been assigned a co-pilot who has a drinking problem and is sober about 70% of the time. You don't get to choose whether you fly with him. If his drinking is random, you can't rely on him for anything that matters. Every flight, you're basically flying alone, except now you also have to watch whether he's about to do something dangerous. He's not helping you. He's adding to your workload. But what if you could predict it? What if you learned the tells, the signs that today is a bad day, the tasks you can hand off versus the ones you need to keep? Now you have something to work with. Not a perfect co-pilot, but one whose limitations you understand. That's the difference between accuracy and trust. A tool that's wrong 10% of the time in ways you can anticipate is more useful than a tool that's wrong 5% of the time at random.

So what does this mean for people building clinical AI? There are two jobs. The first is engineering: making the model better, reducing errors, essentially, getting the co-pilot into recovery. That work is essential, but it's not enough. The second job is building an understanding: helping engineers and physicians see when the tool is likely to fail, why it fails in those situations, and what that means for how doctors should use it. Most AI companies know their benchmark scores, but they often can't tell you the clinical contexts where their model struggles, the types of cases where it's overconfident, the situations where it's reliable enough to lean on. Without that map, physicians have to build their own through trial and error. Some will get burned and stop using the tool entirely, like John with his subway apps. Others will over-trust and miss the moments when it's confidently wrong.

Physicians don't think in benchmarks. They think in trust. Can I rely on this? When should I double-check? What's it good at? Where does it fall apart? A tool can be highly accurate and still not trusted because physicians can't see its limitations. A tool can be imperfect and still trusted because physicians know exactly when to rely on it and when to use their own judgment. The companies that will become physician favorites are the ones who can say: "Here's when our tool is strong. Here's where it struggles. Here's how to use it well." That's not marketing. It's what trust requires.

Are you a doctor interested in the future of healthcare?

Learn about the work

Curious to see how Automate.clinic can help your model accuracy?

How we help AI companies

This is the latest post.

Announcing Automate Clinic’s latest funding round ->