Spotting Misleading Metrics in Early-Stage (GenAI) Start-ups

Advice for Diligence

May 28, 2025

I’ve previously shared advice on why you may want to be skeptical of high accuracy and retention metrics; you’ve probably read about the start-up inflating its customer metrics. Here’s one more to dig into with diligence. 😀

It’s becoming increasingly difficult for early-stage start-ups to secure funding. With technology making it easier to develop prototypes in days, there is intense competition for fewer investors, who now require traction that’s more indicative of later rounds from a few years ago.

One way a startup can demonstrate traction is by showing increased customer usage before and after introducing a feature. If the feature solves a critical customer pain point, its metrics should look good, even if the solution is overengineered. Here are two ways that the customer engagement metric can be inflated.

Scenario 1: High Traction becomes the One Metric that Matters – Above All Else

One way to achieve high traction is to prioritize it, even at the expense of everything else — a phenomenon also known as Goodhart’s Law. Often, there are catastrophic side effects.

We’ve all heard the news story about Wells Fargo opening over 2 million fake accounts for its customers, largely due to excessive sales quotas. These Redditors share other examples of this law in their employers’ businesses.

Start-ups are not immune to this either. To attract more businesses to its platform ahead of its Series F funding, DoorDash scraped restaurant websites and, in one documented instance, sold $16 pizzas to customers that it paid $24 for.

When traction seems “too good to be true,” consider digging into any incentives around the metric. Ideally, you’d also be able to dive into the LTV/CAC ratio (by cohort) – along with how it’s calculated— to understand market demand for the product during diligence. Though, as you know, many early-stage start-ups won’t have enough sales for it to be a stable metric, and will mention retention instead. (I dive into ways that 30-day retention by cohort can be (inadvertently) inflated in this blog post.)

Scenario 2: High Traction when Missing Baseline

Another way to achieve product traction is to assume that our customers, patients, or prospective clients cannot schedule appointments on demand via a calendar widget on the product’s app or website. Then, if our product allows them to finally do so, even if it’s using AI agents, its performance should look good!

Scheduling rates after the AI agent implementation should be higher than before, as this widget was not available previously! Had we compared the AI agent’s scheduling performance to scheduling rates with a calendar widget, I would expect it to do worse (!), since the calendar widget should not only be a fraction of the cost and time to go live, but it does not need to account for the nuances and diversity of the language in doing so.

You may think this is a contrived example, but I’ve done diligence on multiple start-ups that use this as their GTM strategy: start with an AI agent for scheduling, then expand from there. Because whatever the start-up builds, they’ll need to fix and maintain as things break for customers, it will be harder to scale when products are overengineered from the beginning, whether an LLM wrote this code or not.

Advice

When every start-up seems to be pitching costly “AI” solutions, consider evaluating the following in diligence:

How was the baseline metric defined and calculated, to better understand what was/wasn’t previously available?
What is the added value of the “AI agent” solution the start-up is pitching over a calendar widget (or similar)?
What LLM(s) are they using for the agent(s)?
How much do the LLMs cost per month? What are the (other) cloud computing costs of hosting the solution?
What is the time-to-market?
[You may need an expert to help you evaluate.] What is the technology stack for the solution? Can they share an architecture diagram for the product?
[You may need an expert to help you evaluate.] How are they guaranteeing: (1) accurate and reproducible results, and (2) results that account for the nuances and diversity of the (spoken) English language?

Good luck!

You May Also Like

Photo credit: Adopted from a photo by Umanoide on Unsplash.

Advice for AI Startup Diligence

Discussion about this post