When High Accuracy is not Very Accurate

What Questions to Ask When Evaluating "Accurate" ML/AI Algorithms

Feb 03, 2025

In a world where everyone pitches their “AI” start-ups and tries to prove their expertise in AI, the "accuracy" metric is often mentioned. But how accurate is this metric?

There are a few ways to (inadvertently) game the system, resulting in misleading high accuracy about how well an ML/AI algorithm is doing. Here’s advice on identifying these potential yellow/red flags and (some) potential questions to ask in diligence to confirm/refute this.

Preliminaries: How do we measure accuracy in algorithms?

Suppose we have two possible outcomes we’re trying to predict: someone having an early onset of a condition – and not.

Accuracy Metric

Accuracy is the percentage of outcomes the algorithm got correct (Wikipedia).

For example, suppose we had 10 patients who wanted to know if they had an early onset of a condition based on information from their smartwatches – without clinical input.

Our algorithm predicted that only 5 had that condition.
When all 10 went to their physician for a second opinion, 4 were misdiagnosed: 3 didn’t have the condition, and the algorithm missed 1 patient’s early onset of the condition.
The algorithm’s accuracy is 60% (= (2+4)/[(2+1)+7])

Incorrectly Pitched as the Accuracy Metric: Area Under the ROC curve

Sometimes, though, people will (incorrectly) refer to the "Area Under the ROC curve" (AKA ROC-AUC) as a measure of accuracy. That may be because they would like to talk about this metric colloquially – or because it tends to look better when there is a majority class. But it’s not the same thing! What is this thing?!

Returning to our example of predicting the presence/absence of a condition. The algorithm typically returns the probability (between 0 and 1) of a patient having this condition, and we assumed that any probability greater than 50% means that the patient has the condition. But what if we only predicted the condition if that probability was greater than 90% if we want to be super sure? Or, instead, we made this cutoff at 25% if we want people to get a second opinion and start getting treatment ASAP? This probability threshold will change who we predict will or won’t have the condition – and the percentage of outcomes the algorithm got correct, e.g., the algorithm’s accuracy!

If we choose, say, 11 different probability cutoffs between 0% and 100% (e.g., 0%, 10%, 20%, 30%, …, 90%, and 100%) and, for each cutoff, evaluate our set of patients to make predictions of who will and won’t have the condition based on these thresholds, and plot 11 points with the following x-y coordinates:

Fraction of those we predicted would have the condition when they didn’t, relative to everyone we predicted would not have the condition – on the X-axis, one for each threshold;
Fraction of those we predicted would have the condition when they did, relative to everyone we predicted would have the condition – on the associated Y-axis, one for each threshold;
Connect the points;
Then, the area below this line is the “Area Under the ROC curve,” which shows how well we can distinguish between the two outcomes.

(For more examples and visualizations of this metric, please see this blog post by Evidently AI.)

Ways to Have Misleadingly High Accuracy

Now that the definitions are out of the way, let’s talk about 6 ways that an algorithm may have high accuracy when it’s actually not doing well – and why that is.

Scenario 1: Presenting ROC-AUC as "accuracy."

As we saw in the example above, when there was a majority class, the accuracy may not be as high as the ROC-AUC metric, even though the latter is reported in the pitch deck as “accuracy”!

Even if you see a 90%+ accuracy, whether the pitch is about the correct definition of accuracy or ROC-AUC, the following things may be happening under the hood to falsely inflate this metric.

Scenario 2: High accuracy due to data leakage!

Algorithms with data leakage, which was caused by including information that the algorithm shouldn’t have had, also seem like they’re doing well when that couldn’t be further from the truth! More on that in this blog post.

Scenario 3: High accuracy when mispredicting the rare outcomes!

For example, if the condition's prevalence is only 5/100 and the algorithm misses the diagnosis for 4 people, predicting that 99/100 don’t have the condition, the accuracy would still be 96% (1+95)/100!
One pitch deck included 100% accuracy because only one patient had a poor outcome, which the algorithm detected correctly. But how well does the algorithm generalize?!

Scenario 4: High accuracy due to seeing the data before!

During an algorithm’s estimation/training, it’s also the time to choose an optimal probability threshold, typically to maximize the accuracy of the training data… If the accuracy metric is reported based on the data the algorithm has seen (e.g., “training data”), it will be higher – than if the algorithm evaluated patients it had not seen before!

Scenario 5: High accuracy due to overfitting!

[Alternative/addendum to #4] If an algorithm has seen the data before (e.g., “training data”) and no further evaluations are done before accuracy is reported, the algorithm may have identified nonexistent patterns or overfit the (training) data, essentially “jumping to conclusions.” It will actually have a harder time evaluating patients it hasn’t seen before!

Scenario 6: High accuracy due to multicollinearity!

If the algorithm's inputs are very (highly) collinear with the output, accuracy will also look good!
For example, the algorithm might predict someone’s savings rate based on income and expenses. Since savings, by definition, are income minus expenses, the algorithm’s performance for predicting savings from just these two variables will look amazing.

How High is “Too High”?

Depending on the industry, the nature of the product, and the (imbalance) between (two) outcomes, consider further investigating any accuracy/ROC-AUC metrics that are 80% and above!

Consider your risk tolerance and lower this rule-of-thumb threshold to 75%, given how often pitch decks include this (high) metric and the potential for algorithm risks behind the scenes!

Then, when any accuracy is too high – over 75%! – it becomes a potential yellow/red flag to dig into during diligence.

Questions to Ask in Diligence

Here are some questions to ask during diligence when high accuracy is a potential yellow/orange flag to help you better understand if that’s the case.

What information was included in the algorithm?
- Here, you want to hear how well the training data reflects the ICP. Even if the accuracy is not suspicious, if the ICP is barely (if at all) representative in the data used to develop the algorithm, its recommendation will be useless at best and, at worst, propagate inequalities in the population the start-up is trying to serve.
Is the metric based on the training data?
- Here, you want to hear that it was based on the testing data (or data the algorithm has not seen before).
What is the breakdown of outcomes – and their percentages – in the data?
- Follow-up: are they presenting the breakdowns for the training data or the testing data?
  - Follow-up, especially if the class is imbalanced (which it typically will be): was the metric presented an accuracy or an AUC metric?
  - Please note: there are many ways to handle imbalance classes in algorithms, and the details are too technical to get into here; a class imbalance alone does not create a yellow/red flag.

Good luck!

Advice for AI Startup Diligence

Discussion about this post