Advice for AI Startup Diligence

Advice for AI Startup Diligence

When High Accuracy is not Very Accurate

What Questions to Ask When Evaluating "Accurate" ML/AI Algorithms

Irina Kukuyeva PhD's avatar
Irina Kukuyeva PhD
Feb 03, 2025
∙ Paid

In a world where everyone pitches their “AI” start-ups and tries to prove their expertise in AI, the "accuracy" metric is often mentioned. But how accurate is this metric?

There are a few ways to (inadvertently) game the system, resulting in misleading high accuracy about how well an ML/AI algorithm is doing. Here’s advice on identifying these potential yellow/red flags and (some) potential questions to ask in diligence to confirm/refute this.

Preliminaries: How do we measure accuracy in algorithms?

Suppose we have two possible outcomes we’re trying to predict: someone having an early onset of a condition – and not.

Accuracy Metric

Accuracy is the percentage of outcomes the algorithm got correct (Wikipedia).

For example, suppose we had 10 patients who wanted to know if they had an early onset of a condition based on information from their smartwatches – without clinical input.

  • Our algorithm predicted that only 5 had that condition.

  • When all 10 went to their physician for a second opinion, 4 were misdiagnosed: 3 didn’t have the condition, and the algorithm missed 1 patient’s early onset of the condition.

  • The algorithm’s accuracy is 60% (= (2+4)/[(2+1)+7])

Incorrectly Pitched as the Accuracy Metric: Area Under the ROC curve

Sometimes, though, people will (incorrectly) refer to the "Area Under the ROC curve" (AKA ROC-AUC) as a measure of accuracy. That may be because they would like to talk about this metric colloquially – or because it tends to look better when there is a majority class. But it’s not the same thing! What is this thing?!

Returning to our example of predicting the presence/absence of a condition. The algorithm typically returns the probability (between 0 and 1) of a patient having this condition, and we assumed that any probability greater than 50% means that the patient has the condition. But what if we only predicted the condition if that probability was greater than 90% if we want to be super sure? Or, instead, we made this cutoff at 25% if we want people to get a second opinion and start getting treatment ASAP? This probability threshold will change who we predict will or won’t have the condition – and the percentage of outcomes the algorithm got correct, e.g., the algorithm’s accuracy!

If we choose, say, 11 different probability cutoffs between 0% and 100% (e.g., 0%, 10%, 20%, 30%, …, 90%, and 100%) and, for each cutoff, evaluate our set of patients to make predictions of who will and won’t have the condition based on these thresholds, and plot 11 points with the following x-y coordinates:

  • Fraction of those we predicted would have the condition when they didn’t, relative to everyone we predicted would not have the condition – on the X-axis, one for each threshold;

  • Fraction of those we predicted would have the condition when they did, relative to everyone we predicted would have the condition – on the associated Y-axis, one for each threshold;

  • Connect the points;

  • Then, the area below this line is the “Area Under the ROC curve,” which shows how well we can distinguish between the two outcomes.

(For more examples and visualizations of this metric, please see this blog post by Evidently AI.)

6 Ways to Have Misleadingly High Accuracy

Now that the definitions are out of the way, let’s talk about 6 ways that an algorithm may have high accuracy when it’s actually not doing well – and why that is.

Scenario 1: Presenting ROC-AUC as "accuracy."

  • As we saw in the example above, when there was a majority class, the accuracy may not be as high as the ROC-AUC metric, even though the latter is reported in the pitch deck as “accuracy”!

Even if you see a 90%+ accuracy, whether the pitch is about the correct definition of accuracy or ROC-AUC, the following things may be happening under the hood to falsely inflate this metric.

Scenario 2: High accuracy due to data leakage!

  • Algorithms with data leakage, which was caused by including information that the algorithm shouldn’t have had, also seem like they’re doing well when that couldn’t be further from the truth! More on that in this blog post.

Keep reading with a 7-day free trial

Subscribe to Advice for AI Startup Diligence to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Irina Kukuyeva, PhD
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture