How 100 hours of physician time beats any benchmark

On

by

Topics

AI

Companies

As a team building a clinical AI product, you're probably shipping updates every two to four weeks. But if your clinical evaluation takes three months to run, the results you get back describe a product you no longer ship. How are you making decisions in the meantime? Gut check? Automated evals based on single-turn prompts? Multiple-choice benchmark scores?

Those aren't wrong exactly. They're just answering a narrower question than you think they are.

Most clinical AI isn't evaluated on the thing that actually matters in deployment: the full trace. A trace is the complete record of a user session. Every message, every tool call, every document retrieved, every intermediate step from the first user input to the final response. A physician using a clinical AI tool to work through a patient case isn't sending a single prompt. They're asking follow-up questions, pulling context, triggering tool calls, refining the query. What you're really evaluating is the whole sequence, and a multiple-choice benchmark doesn't touch it.

Automated single-turn evals are good for what they're good for: catching regressions on narrow, well-defined tasks, making sure you didn't break something you already had working. But they miss a different class of problem entirely. In a recent client evaluation, what surfaced were consistent patterns in medication dosing and drug selection errors across four specific categories. The product wasn't hallucinating. It was confident and fluent. What it was doing was leaving things out, omitting the relevant clinical consideration often enough that it constituted real risk. A recent study called NOHARM found that 76% of severe clinical AI errors are exactly this: omissions, not confabulations. That's not something a multiple choice benchmark catches. It's something a physician reading a trace noticed, wrote about, and then noticed again, and again, until it was undeniable.

So if not automated evals, then what?

Most teams are either not doing what works or not doing it at scale. Let's call it the clinical vibe check.

Take 100 traces (real sessions from real users) and put them in front of an expert physician. Ask them to write freely about what they see. No scorecard, no forced ranking. Just: what do you notice? Where does this feel off? What would you want an engineering team to know?

The first 100 traces with one physician aren't about getting a definitive answer. They're about finding the themes. Maybe the product is verbally exhausting, the kind of thing that makes you think you're listening to a med student present a case. Maybe it's anchoring on the most obvious diagnosis and ignoring alternatives. Maybe errors are clustering around a specific medication class or patient population. You don't design for those findings in advance. You discover them.

Once you have themes, you scale it. Put five physicians on the next 100 traces. Each one writes freely. Now you have 600 observations for roughly 100 hours of expert time, and the themes aren't faint signals anymore. The product struggles with weight-based dosing in pediatric patients. The product is so long-winded that clinicians stop reading before the relevant recommendation. Whatever it is, now you know specifically enough to do something about it.

Once you have a clear theme, you build targeted prompts that probe it directly. If medication dosing is the issue, you create 100 prompts specifically designed around medication dosing scenarios, then run those through the same vibe check: a few physicians, free-form observations, plus a handful of anchoring questions to keep things consistent across reviewers.

The anchoring questions matter, not because they replace the free text but because they create comparability. Was the response clinically safe? Was anything clinically important missing? Would you trust this output in a real patient encounter? Three or four questions like that, rated simply, give you a baseline you can measure against. The free text gives you everything you didn't think to ask about.

Now your engineers make changes. Run it again. Compare to your baseline. Did those changes help? You have an actual answer, not a benchmark score that may or may not correlate with real-world clinical performance, but a specific, clinically-grounded read on whether your product got better at the thing you knew it was failing at. Then you uncover the next weakness and repeat the cycle.

Most teams get stuck at the execution layer, and it's not really a methodology problem. A team relying on two or three physician advisors who are also practicing full-time gets clinical feedback when those advisors have time, which is not the same as when you need it. A two-week sprint becomes a three-month eval cycle not because the evaluation itself takes that long, but because coordinating a few busy physicians to review traces is a side project for all of them. The process above (100 traces, one physician, free-form observations) probably takes two weeks of calendar time when clinical evaluation is someone's actual job, and three months when it isn't.

How are you defining your product getting better? Gut check is honest but not systematic. Automated benchmarks are systematic but skip the failure modes that matter most. What you actually want is a process you can run on a repeatable cadence, one that produces findings specific enough for an engineering team to act on and a baseline you can measure against over time. Most clinical AI companies don't have that yet, largely because building it means solving a coordination problem, not just a methodology problem.

We built Automate Clinic to solve exactly this. The physicians in our network move at engineering speed because clinical evaluation is their job in this context, not a side project. The result is clinical feedback that comes back in days, specific enough to tell your engineering team where the product is failing and what to fix.

There's also a natural next step once you've built that clinical baseline, and it's where physician evaluation and automation start working together rather than competing. The physician verdicts from your vibe checks aren't just findings. They're training signal for an automated judge calibrated to clinical reality, rather than one AI grading another AI with no grounding in what good care looks like. I'll write about that next.

If you're building clinical AI and you don't have a systematic process for knowing your product is improving, I'd like to talk.

Clinical AI evaluation is a new kind of clinical work, and we're building the physician community around it.

Most clinical AI teams don't have a repeatable process for knowing their product is improving. We build that.

This is the latest post.

How 100 hours of physician time beats any benchmark

On

by

Topics

AI

Companies

As a team building a clinical AI product, you're probably shipping updates every two to four weeks. But if your clinical evaluation takes three months to run, the results you get back describe a product you no longer ship. How are you making decisions in the meantime? Gut check? Automated evals based on single-turn prompts? Multiple-choice benchmark scores?

Those aren't wrong exactly. They're just answering a narrower question than you think they are.

Most clinical AI isn't evaluated on the thing that actually matters in deployment: the full trace. A trace is the complete record of a user session. Every message, every tool call, every document retrieved, every intermediate step from the first user input to the final response. A physician using a clinical AI tool to work through a patient case isn't sending a single prompt. They're asking follow-up questions, pulling context, triggering tool calls, refining the query. What you're really evaluating is the whole sequence, and a multiple-choice benchmark doesn't touch it.

Automated single-turn evals are good for what they're good for: catching regressions on narrow, well-defined tasks, making sure you didn't break something you already had working. But they miss a different class of problem entirely. In a recent client evaluation, what surfaced were consistent patterns in medication dosing and drug selection errors across four specific categories. The product wasn't hallucinating. It was confident and fluent. What it was doing was leaving things out, omitting the relevant clinical consideration often enough that it constituted real risk. A recent study called NOHARM found that 76% of severe clinical AI errors are exactly this: omissions, not confabulations. That's not something a multiple choice benchmark catches. It's something a physician reading a trace noticed, wrote about, and then noticed again, and again, until it was undeniable.

So if not automated evals, then what?

Most teams are either not doing what works or not doing it at scale. Let's call it the clinical vibe check.

Take 100 traces (real sessions from real users) and put them in front of an expert physician. Ask them to write freely about what they see. No scorecard, no forced ranking. Just: what do you notice? Where does this feel off? What would you want an engineering team to know?

The first 100 traces with one physician aren't about getting a definitive answer. They're about finding the themes. Maybe the product is verbally exhausting, the kind of thing that makes you think you're listening to a med student present a case. Maybe it's anchoring on the most obvious diagnosis and ignoring alternatives. Maybe errors are clustering around a specific medication class or patient population. You don't design for those findings in advance. You discover them.

Once you have themes, you scale it. Put five physicians on the next 100 traces. Each one writes freely. Now you have 600 observations for roughly 100 hours of expert time, and the themes aren't faint signals anymore. The product struggles with weight-based dosing in pediatric patients. The product is so long-winded that clinicians stop reading before the relevant recommendation. Whatever it is, now you know specifically enough to do something about it.

Once you have a clear theme, you build targeted prompts that probe it directly. If medication dosing is the issue, you create 100 prompts specifically designed around medication dosing scenarios, then run those through the same vibe check: a few physicians, free-form observations, plus a handful of anchoring questions to keep things consistent across reviewers.

The anchoring questions matter, not because they replace the free text but because they create comparability. Was the response clinically safe? Was anything clinically important missing? Would you trust this output in a real patient encounter? Three or four questions like that, rated simply, give you a baseline you can measure against. The free text gives you everything you didn't think to ask about.

Now your engineers make changes. Run it again. Compare to your baseline. Did those changes help? You have an actual answer, not a benchmark score that may or may not correlate with real-world clinical performance, but a specific, clinically-grounded read on whether your product got better at the thing you knew it was failing at. Then you uncover the next weakness and repeat the cycle.

Most teams get stuck at the execution layer, and it's not really a methodology problem. A team relying on two or three physician advisors who are also practicing full-time gets clinical feedback when those advisors have time, which is not the same as when you need it. A two-week sprint becomes a three-month eval cycle not because the evaluation itself takes that long, but because coordinating a few busy physicians to review traces is a side project for all of them. The process above (100 traces, one physician, free-form observations) probably takes two weeks of calendar time when clinical evaluation is someone's actual job, and three months when it isn't.

How are you defining your product getting better? Gut check is honest but not systematic. Automated benchmarks are systematic but skip the failure modes that matter most. What you actually want is a process you can run on a repeatable cadence, one that produces findings specific enough for an engineering team to act on and a baseline you can measure against over time. Most clinical AI companies don't have that yet, largely because building it means solving a coordination problem, not just a methodology problem.

We built Automate Clinic to solve exactly this. The physicians in our network move at engineering speed because clinical evaluation is their job in this context, not a side project. The result is clinical feedback that comes back in days, specific enough to tell your engineering team where the product is failing and what to fix.

There's also a natural next step once you've built that clinical baseline, and it's where physician evaluation and automation start working together rather than competing. The physician verdicts from your vibe checks aren't just findings. They're training signal for an automated judge calibrated to clinical reality, rather than one AI grading another AI with no grounding in what good care looks like. I'll write about that next.

If you're building clinical AI and you don't have a systematic process for knowing your product is improving, I'd like to talk.

Clinical AI evaluation is a new kind of clinical work, and we're building the physician community around it.

Most clinical AI teams don't have a repeatable process for knowing their product is improving. We build that.

This is the latest post.

How 100 hours of physician time beats any benchmark

On

by

Topics

AI

Companies

As a team building a clinical AI product, you're probably shipping updates every two to four weeks. But if your clinical evaluation takes three months to run, the results you get back describe a product you no longer ship. How are you making decisions in the meantime? Gut check? Automated evals based on single-turn prompts? Multiple-choice benchmark scores?

Those aren't wrong exactly. They're just answering a narrower question than you think they are.

Most clinical AI isn't evaluated on the thing that actually matters in deployment: the full trace. A trace is the complete record of a user session. Every message, every tool call, every document retrieved, every intermediate step from the first user input to the final response. A physician using a clinical AI tool to work through a patient case isn't sending a single prompt. They're asking follow-up questions, pulling context, triggering tool calls, refining the query. What you're really evaluating is the whole sequence, and a multiple-choice benchmark doesn't touch it.

Automated single-turn evals are good for what they're good for: catching regressions on narrow, well-defined tasks, making sure you didn't break something you already had working. But they miss a different class of problem entirely. In a recent client evaluation, what surfaced were consistent patterns in medication dosing and drug selection errors across four specific categories. The product wasn't hallucinating. It was confident and fluent. What it was doing was leaving things out, omitting the relevant clinical consideration often enough that it constituted real risk. A recent study called NOHARM found that 76% of severe clinical AI errors are exactly this: omissions, not confabulations. That's not something a multiple choice benchmark catches. It's something a physician reading a trace noticed, wrote about, and then noticed again, and again, until it was undeniable.

So if not automated evals, then what?

Most teams are either not doing what works or not doing it at scale. Let's call it the clinical vibe check.

Take 100 traces (real sessions from real users) and put them in front of an expert physician. Ask them to write freely about what they see. No scorecard, no forced ranking. Just: what do you notice? Where does this feel off? What would you want an engineering team to know?

The first 100 traces with one physician aren't about getting a definitive answer. They're about finding the themes. Maybe the product is verbally exhausting, the kind of thing that makes you think you're listening to a med student present a case. Maybe it's anchoring on the most obvious diagnosis and ignoring alternatives. Maybe errors are clustering around a specific medication class or patient population. You don't design for those findings in advance. You discover them.

Once you have themes, you scale it. Put five physicians on the next 100 traces. Each one writes freely. Now you have 600 observations for roughly 100 hours of expert time, and the themes aren't faint signals anymore. The product struggles with weight-based dosing in pediatric patients. The product is so long-winded that clinicians stop reading before the relevant recommendation. Whatever it is, now you know specifically enough to do something about it.

Once you have a clear theme, you build targeted prompts that probe it directly. If medication dosing is the issue, you create 100 prompts specifically designed around medication dosing scenarios, then run those through the same vibe check: a few physicians, free-form observations, plus a handful of anchoring questions to keep things consistent across reviewers.

The anchoring questions matter, not because they replace the free text but because they create comparability. Was the response clinically safe? Was anything clinically important missing? Would you trust this output in a real patient encounter? Three or four questions like that, rated simply, give you a baseline you can measure against. The free text gives you everything you didn't think to ask about.

Now your engineers make changes. Run it again. Compare to your baseline. Did those changes help? You have an actual answer, not a benchmark score that may or may not correlate with real-world clinical performance, but a specific, clinically-grounded read on whether your product got better at the thing you knew it was failing at. Then you uncover the next weakness and repeat the cycle.

Most teams get stuck at the execution layer, and it's not really a methodology problem. A team relying on two or three physician advisors who are also practicing full-time gets clinical feedback when those advisors have time, which is not the same as when you need it. A two-week sprint becomes a three-month eval cycle not because the evaluation itself takes that long, but because coordinating a few busy physicians to review traces is a side project for all of them. The process above (100 traces, one physician, free-form observations) probably takes two weeks of calendar time when clinical evaluation is someone's actual job, and three months when it isn't.

How are you defining your product getting better? Gut check is honest but not systematic. Automated benchmarks are systematic but skip the failure modes that matter most. What you actually want is a process you can run on a repeatable cadence, one that produces findings specific enough for an engineering team to act on and a baseline you can measure against over time. Most clinical AI companies don't have that yet, largely because building it means solving a coordination problem, not just a methodology problem.

We built Automate Clinic to solve exactly this. The physicians in our network move at engineering speed because clinical evaluation is their job in this context, not a side project. The result is clinical feedback that comes back in days, specific enough to tell your engineering team where the product is failing and what to fix.

There's also a natural next step once you've built that clinical baseline, and it's where physician evaluation and automation start working together rather than competing. The physician verdicts from your vibe checks aren't just findings. They're training signal for an automated judge calibrated to clinical reality, rather than one AI grading another AI with no grounding in what good care looks like. I'll write about that next.

If you're building clinical AI and you don't have a systematic process for knowing your product is improving, I'd like to talk.

Clinical AI evaluation is a new kind of clinical work, and we're building the physician community around it.

Most clinical AI teams don't have a repeatable process for knowing their product is improving. We build that.

This is the latest post.