AI Failure Mode Literacy

On

March 31, 2026

by

Jay Parkinson

Topics

AI

I live about fifteen minutes west of Boulder, roughly 2,000 feet higher than the city, way up a dirt switchbacked road in the mountains. It's a beautiful place to live and a terrible place to be wrong about the weather. If I misjudge a storm, I can't get home or leave the house. If the power goes out, I lose water and heat. The stakes of a bad forecast are immediate for me.

When I moved here a few years ago, I did what everyone does. I checked Apple Weather. And for about 300 days a year, Apple Weather is perfectly fine. Boulder gets more sunshine than San Diego. Most days are easy calls.

But the other 65 days will ruin you if you're not paying attention. The Front Range of the Rockies is one of the hardest places on earth to predict weather. Cold air pools against the mountains in ways models struggle with. Upslope storms materialize out of nothing. A forecast calling for two inches of snow can deliver fourteen. There's a local meteorologist named Chris Bianchi who's become something of a legend along the Front Range. (Say his name fast enough and it starts sounding like "Crispy Donkey," which is how he ended up being called "The Donk" in my friend group's weather debates.) The Donk sees something interesting brewing in the local models five to seven days before it arrives and consistently posts the same warning: "Hey folks, don't trust your automated weather apps this week." He knows exactly when the algorithms break down, and he tells people before they make bad decisions.

After a few years here, I've built my own system. Apple Weather for the 300 easy days. And when the weather gets wild, a combo of The Donk and OpenSnow, which is a weather service built specifically for mountainous regions. OpenSnow exists because the mountains are full of microclimates that general-purpose weather apps can't resolve. My house is about a half mile past the local volunteer fire station. The locals call everything below the fire station "the tropics" because almost without fail, it's raining before the fire station and snowing by the time you pass it on the climb up to my place. That's the kind of granularity OpenSnow captures and Apple Weather has no idea about. Knowing which tools to trust when, and when to override all of them with my own judgment, took a few years of lived experience.

I didn't learn any of this from a manual. I learned it because the consequences of being wrong forced me to pay attention to how my tools fail, not just whether they work.

I've started calling this AI failure mode literacy. I think it's one of the most underdeveloped capabilities in professional life right now, and the gap is widest in medicine.

What is failure mode literacy?

Every experienced professional already has this for their traditional tools, even if they've never called it that. A carpenter knows that a circular saw binds in wet wood. A pilot knows that GPS accuracy degrades in certain atmospheric conditions. These aren't bugs. They're known behaviors that professionals learn to anticipate and work around. You use a tool, it fails in a specific way, you remember that failure, and over time you build a mental map of where the tool is reliable and where you need to compensate. That map is failure mode literacy.

Deterministic tools make this relatively easy. They fail the same way every time. You learn it once and you're set. AI is different in a way that matters enormously. It's stochastic. Ask it the same question on Tuesday and you might get a different answer than Monday. Your clinical decision support tool might handle a drug interaction correctly nine times and miss it on the tenth, for reasons that aren't visible to the person using it. You can't learn the failure once and be done. You need to understand the patterns of failure, the conditions that make failure more likely, the categories of question where the tool tends to be weaker.

And unlike my weather situation, where I had years of low-stakes trial and error, most professionals using AI tools are being asked to trust them immediately, on high-stakes decisions, with no failure mode map at all.

Why this literacy doesn't exist yet in healthcare AI

It starts with incentives. The companies building these tools have no reason to publish their failure modes. No one puts "our product is particularly unreliable at geriatric drug dosing" in a sales deck. Companies publish benchmark scores that aggregate performance across thousands of test cases, producing a single accuracy number that obscures the specific conditions where the tool falls apart. An AI tool that scores 92% on a clinical reasoning benchmark might be 99% accurate on common presentations and 40% accurate on rare drug interactions. That 92% tells you nothing about where to be careful.

On the buyer side, health systems don't have the methodology to discover failure modes independently. Most hospitals don't employ people whose job is to systematically stress-test an AI tool the way The Donk stress-tests weather models. They rely on the vendor's own evaluation data or on individual clinicians noticing errors one at a time. That's the equivalent of me figuring out my weather tools by getting stranded on my road repeatedly until I noticed the pattern. It works eventually, but the cost of the learning process is unacceptable when the stakes involve patients instead of groceries.

And there's a subtler problem. The most dangerous AI failure mode is omission. Research on clinical AI errors has found that roughly 76% of severe failures are omission errors, not hallucinations. The tool doesn't say something obviously wrong. It says something that sounds right but leaves out the piece of information that would change the clinical decision. We recently ran a blinded evaluation of three clinical AI tools where our physician evaluators discovered a concentrated cluster of medication errors in one product. Wrong conversion ratios, incorrect indication-specific dosing, drug safety classifications that appeared to be fabricated. All embedded in responses that otherwise sounded authoritative and clinically solid. A physician using that tool would have no way to know that medication dosing was the specific zone where it couldn't be trusted. You can't build literacy for a failure you never notice happening.

Where this leads

This is ultimately an institutional problem, not an individual one. I built my weather literacy on my own, over years. A health system can't wait for two thousand clinicians to each independently discover the failure modes through trial and error on real patients. The institution needs to provide that literacy proactively, and that requires systematic discovery of where the tools fail, then translation of those findings into something clinicians can act on.

The Donk is a good model for what that looks like. He sits between the automated tools and the people who depend on them. He understands the conditions where the models fail. And when those conditions arise, he translates that knowledge into plain-language guidance that you can act on. He goes further than saying "the models are unreliable." He says "don't trust your automated apps this week, for this specific weather pattern, because the models consistently get this wrong."

Healthcare AI needs the same thing, and a single accuracy score on a marketing page doesn't get you there. Clinicians need to know where their tool fails, under what conditions, and what to do differently when they're in that zone. Health systems need to know the categorical weaknesses of the tools they've deployed and how to communicate those to the people using them every day.

I spent years building my own failure mode literacy for mountain weather, and I had the luxury of the stakes being personal inconvenience. Clinicians using AI tools on patients don't get that luxury. Someone needs to do the work for them. That's what we're building at Automate Clinic, and I think the demand for it is about to become very obvious very quickly.

Clinical AI evaluation is a new kind of clinical work, and we're building the physician community around it.

Join the Faculty

Most clinical AI teams don't have a repeatable process for knowing their product is improving. We build that.

How we help AI companies

This is the latest post.

How 100 hours of physician time beats any benchmark ->

AI Failure Mode Literacy

On

March 31, 2026

by

Jay Parkinson

Topics

AI

I live about fifteen minutes west of Boulder, roughly 2,000 feet higher than the city, way up a dirt switchbacked road in the mountains. It's a beautiful place to live and a terrible place to be wrong about the weather. If I misjudge a storm, I can't get home or leave the house. If the power goes out, I lose water and heat. The stakes of a bad forecast are immediate for me.

When I moved here a few years ago, I did what everyone does. I checked Apple Weather. And for about 300 days a year, Apple Weather is perfectly fine. Boulder gets more sunshine than San Diego. Most days are easy calls.

But the other 65 days will ruin you if you're not paying attention. The Front Range of the Rockies is one of the hardest places on earth to predict weather. Cold air pools against the mountains in ways models struggle with. Upslope storms materialize out of nothing. A forecast calling for two inches of snow can deliver fourteen. There's a local meteorologist named Chris Bianchi who's become something of a legend along the Front Range. (Say his name fast enough and it starts sounding like "Crispy Donkey," which is how he ended up being called "The Donk" in my friend group's weather debates.) The Donk sees something interesting brewing in the local models five to seven days before it arrives and consistently posts the same warning: "Hey folks, don't trust your automated weather apps this week." He knows exactly when the algorithms break down, and he tells people before they make bad decisions.

After a few years here, I've built my own system. Apple Weather for the 300 easy days. And when the weather gets wild, a combo of The Donk and OpenSnow, which is a weather service built specifically for mountainous regions. OpenSnow exists because the mountains are full of microclimates that general-purpose weather apps can't resolve. My house is about a half mile past the local volunteer fire station. The locals call everything below the fire station "the tropics" because almost without fail, it's raining before the fire station and snowing by the time you pass it on the climb up to my place. That's the kind of granularity OpenSnow captures and Apple Weather has no idea about. Knowing which tools to trust when, and when to override all of them with my own judgment, took a few years of lived experience.

I didn't learn any of this from a manual. I learned it because the consequences of being wrong forced me to pay attention to how my tools fail, not just whether they work.

I've started calling this AI failure mode literacy. I think it's one of the most underdeveloped capabilities in professional life right now, and the gap is widest in medicine.

What is failure mode literacy?

Every experienced professional already has this for their traditional tools, even if they've never called it that. A carpenter knows that a circular saw binds in wet wood. A pilot knows that GPS accuracy degrades in certain atmospheric conditions. These aren't bugs. They're known behaviors that professionals learn to anticipate and work around. You use a tool, it fails in a specific way, you remember that failure, and over time you build a mental map of where the tool is reliable and where you need to compensate. That map is failure mode literacy.

Deterministic tools make this relatively easy. They fail the same way every time. You learn it once and you're set. AI is different in a way that matters enormously. It's stochastic. Ask it the same question on Tuesday and you might get a different answer than Monday. Your clinical decision support tool might handle a drug interaction correctly nine times and miss it on the tenth, for reasons that aren't visible to the person using it. You can't learn the failure once and be done. You need to understand the patterns of failure, the conditions that make failure more likely, the categories of question where the tool tends to be weaker.

And unlike my weather situation, where I had years of low-stakes trial and error, most professionals using AI tools are being asked to trust them immediately, on high-stakes decisions, with no failure mode map at all.

Why this literacy doesn't exist yet in healthcare AI

It starts with incentives. The companies building these tools have no reason to publish their failure modes. No one puts "our product is particularly unreliable at geriatric drug dosing" in a sales deck. Companies publish benchmark scores that aggregate performance across thousands of test cases, producing a single accuracy number that obscures the specific conditions where the tool falls apart. An AI tool that scores 92% on a clinical reasoning benchmark might be 99% accurate on common presentations and 40% accurate on rare drug interactions. That 92% tells you nothing about where to be careful.

On the buyer side, health systems don't have the methodology to discover failure modes independently. Most hospitals don't employ people whose job is to systematically stress-test an AI tool the way The Donk stress-tests weather models. They rely on the vendor's own evaluation data or on individual clinicians noticing errors one at a time. That's the equivalent of me figuring out my weather tools by getting stranded on my road repeatedly until I noticed the pattern. It works eventually, but the cost of the learning process is unacceptable when the stakes involve patients instead of groceries.

And there's a subtler problem. The most dangerous AI failure mode is omission. Research on clinical AI errors has found that roughly 76% of severe failures are omission errors, not hallucinations. The tool doesn't say something obviously wrong. It says something that sounds right but leaves out the piece of information that would change the clinical decision. We recently ran a blinded evaluation of three clinical AI tools where our physician evaluators discovered a concentrated cluster of medication errors in one product. Wrong conversion ratios, incorrect indication-specific dosing, drug safety classifications that appeared to be fabricated. All embedded in responses that otherwise sounded authoritative and clinically solid. A physician using that tool would have no way to know that medication dosing was the specific zone where it couldn't be trusted. You can't build literacy for a failure you never notice happening.

Where this leads

This is ultimately an institutional problem, not an individual one. I built my weather literacy on my own, over years. A health system can't wait for two thousand clinicians to each independently discover the failure modes through trial and error on real patients. The institution needs to provide that literacy proactively, and that requires systematic discovery of where the tools fail, then translation of those findings into something clinicians can act on.

The Donk is a good model for what that looks like. He sits between the automated tools and the people who depend on them. He understands the conditions where the models fail. And when those conditions arise, he translates that knowledge into plain-language guidance that you can act on. He goes further than saying "the models are unreliable." He says "don't trust your automated apps this week, for this specific weather pattern, because the models consistently get this wrong."

Healthcare AI needs the same thing, and a single accuracy score on a marketing page doesn't get you there. Clinicians need to know where their tool fails, under what conditions, and what to do differently when they're in that zone. Health systems need to know the categorical weaknesses of the tools they've deployed and how to communicate those to the people using them every day.

I spent years building my own failure mode literacy for mountain weather, and I had the luxury of the stakes being personal inconvenience. Clinicians using AI tools on patients don't get that luxury. Someone needs to do the work for them. That's what we're building at Automate Clinic, and I think the demand for it is about to become very obvious very quickly.

Clinical AI evaluation is a new kind of clinical work, and we're building the physician community around it.

Join the Faculty

Most clinical AI teams don't have a repeatable process for knowing their product is improving. We build that.

How we help AI companies

This is the latest post.

How 100 hours of physician time beats any benchmark ->

AI Failure Mode Literacy

On

March 31, 2026

by

Jay Parkinson

Topics

AI

I live about fifteen minutes west of Boulder, roughly 2,000 feet higher than the city, way up a dirt switchbacked road in the mountains. It's a beautiful place to live and a terrible place to be wrong about the weather. If I misjudge a storm, I can't get home or leave the house. If the power goes out, I lose water and heat. The stakes of a bad forecast are immediate for me.

When I moved here a few years ago, I did what everyone does. I checked Apple Weather. And for about 300 days a year, Apple Weather is perfectly fine. Boulder gets more sunshine than San Diego. Most days are easy calls.

But the other 65 days will ruin you if you're not paying attention. The Front Range of the Rockies is one of the hardest places on earth to predict weather. Cold air pools against the mountains in ways models struggle with. Upslope storms materialize out of nothing. A forecast calling for two inches of snow can deliver fourteen. There's a local meteorologist named Chris Bianchi who's become something of a legend along the Front Range. (Say his name fast enough and it starts sounding like "Crispy Donkey," which is how he ended up being called "The Donk" in my friend group's weather debates.) The Donk sees something interesting brewing in the local models five to seven days before it arrives and consistently posts the same warning: "Hey folks, don't trust your automated weather apps this week." He knows exactly when the algorithms break down, and he tells people before they make bad decisions.

After a few years here, I've built my own system. Apple Weather for the 300 easy days. And when the weather gets wild, a combo of The Donk and OpenSnow, which is a weather service built specifically for mountainous regions. OpenSnow exists because the mountains are full of microclimates that general-purpose weather apps can't resolve. My house is about a half mile past the local volunteer fire station. The locals call everything below the fire station "the tropics" because almost without fail, it's raining before the fire station and snowing by the time you pass it on the climb up to my place. That's the kind of granularity OpenSnow captures and Apple Weather has no idea about. Knowing which tools to trust when, and when to override all of them with my own judgment, took a few years of lived experience.

I didn't learn any of this from a manual. I learned it because the consequences of being wrong forced me to pay attention to how my tools fail, not just whether they work.

I've started calling this AI failure mode literacy. I think it's one of the most underdeveloped capabilities in professional life right now, and the gap is widest in medicine.

What is failure mode literacy?

Every experienced professional already has this for their traditional tools, even if they've never called it that. A carpenter knows that a circular saw binds in wet wood. A pilot knows that GPS accuracy degrades in certain atmospheric conditions. These aren't bugs. They're known behaviors that professionals learn to anticipate and work around. You use a tool, it fails in a specific way, you remember that failure, and over time you build a mental map of where the tool is reliable and where you need to compensate. That map is failure mode literacy.

Deterministic tools make this relatively easy. They fail the same way every time. You learn it once and you're set. AI is different in a way that matters enormously. It's stochastic. Ask it the same question on Tuesday and you might get a different answer than Monday. Your clinical decision support tool might handle a drug interaction correctly nine times and miss it on the tenth, for reasons that aren't visible to the person using it. You can't learn the failure once and be done. You need to understand the patterns of failure, the conditions that make failure more likely, the categories of question where the tool tends to be weaker.

And unlike my weather situation, where I had years of low-stakes trial and error, most professionals using AI tools are being asked to trust them immediately, on high-stakes decisions, with no failure mode map at all.

Why this literacy doesn't exist yet in healthcare AI

It starts with incentives. The companies building these tools have no reason to publish their failure modes. No one puts "our product is particularly unreliable at geriatric drug dosing" in a sales deck. Companies publish benchmark scores that aggregate performance across thousands of test cases, producing a single accuracy number that obscures the specific conditions where the tool falls apart. An AI tool that scores 92% on a clinical reasoning benchmark might be 99% accurate on common presentations and 40% accurate on rare drug interactions. That 92% tells you nothing about where to be careful.

On the buyer side, health systems don't have the methodology to discover failure modes independently. Most hospitals don't employ people whose job is to systematically stress-test an AI tool the way The Donk stress-tests weather models. They rely on the vendor's own evaluation data or on individual clinicians noticing errors one at a time. That's the equivalent of me figuring out my weather tools by getting stranded on my road repeatedly until I noticed the pattern. It works eventually, but the cost of the learning process is unacceptable when the stakes involve patients instead of groceries.

And there's a subtler problem. The most dangerous AI failure mode is omission. Research on clinical AI errors has found that roughly 76% of severe failures are omission errors, not hallucinations. The tool doesn't say something obviously wrong. It says something that sounds right but leaves out the piece of information that would change the clinical decision. We recently ran a blinded evaluation of three clinical AI tools where our physician evaluators discovered a concentrated cluster of medication errors in one product. Wrong conversion ratios, incorrect indication-specific dosing, drug safety classifications that appeared to be fabricated. All embedded in responses that otherwise sounded authoritative and clinically solid. A physician using that tool would have no way to know that medication dosing was the specific zone where it couldn't be trusted. You can't build literacy for a failure you never notice happening.

Where this leads

This is ultimately an institutional problem, not an individual one. I built my weather literacy on my own, over years. A health system can't wait for two thousand clinicians to each independently discover the failure modes through trial and error on real patients. The institution needs to provide that literacy proactively, and that requires systematic discovery of where the tools fail, then translation of those findings into something clinicians can act on.

The Donk is a good model for what that looks like. He sits between the automated tools and the people who depend on them. He understands the conditions where the models fail. And when those conditions arise, he translates that knowledge into plain-language guidance that you can act on. He goes further than saying "the models are unreliable." He says "don't trust your automated apps this week, for this specific weather pattern, because the models consistently get this wrong."

Healthcare AI needs the same thing, and a single accuracy score on a marketing page doesn't get you there. Clinicians need to know where their tool fails, under what conditions, and what to do differently when they're in that zone. Health systems need to know the categorical weaknesses of the tools they've deployed and how to communicate those to the people using them every day.

I spent years building my own failure mode literacy for mountain weather, and I had the luxury of the stakes being personal inconvenience. Clinicians using AI tools on patients don't get that luxury. Someone needs to do the work for them. That's what we're building at Automate Clinic, and I think the demand for it is about to become very obvious very quickly.

Clinical AI evaluation is a new kind of clinical work, and we're building the physician community around it.

Join the Faculty

Most clinical AI teams don't have a repeatable process for knowing their product is improving. We build that.

How we help AI companies

This is the latest post.

How 100 hours of physician time beats any benchmark ->