Product Experimentation is Broken

Olivia Higgs

June 29, 2024

Lots of pressure is put on product teams to boost productivity - whether it’s to outpace the competition or stretch resources further in the face of an unstable market. The result? We've seen an experimentation mindset and testing become more mainstream, with many turning to statistical models to help decipher what to build. A/B testing specifically has become synonymous with product experimentation but despite its widespread adoption among data-driven product teams - it’s not without its flaws.

Spending resources on unproven ideas

When it comes to product experimentation, you’ll often hear failure is good and that failure leads to learning. It’s linked with the lean startup approach of build first then measure and learn, and although well-meaning, can lead to a process of brute-force trial-and-error, particularly through A/B testing. Why does that matter? On average only 1 in 7 A/B tests are a statistically significant winning test (according to studies from Convert and VWO). For companies with small teams and limited budgets, they don’t have the luxury of spending resources on trial-and-error. So often many associate product experimentation with the ‘big boys’ who have large cross-functional teams with dedicated data scientists and engineers working on optimizing experiments. But that doesn’t necessarily have to be the case - more on that later on.

Frustrated teams

Regardless of company size, no one likes the feeling of pouring money down the drain. Engineering resources are scarce and need to be allocated to dealing with requests from senior leadership, bug fixes, customer requests, sales initiatives, UX improvements…the list goes on. Engineers often quickly get frustrated if they’re having to make lots of small iterative changes for the majority of them to fail and be scrapped just a few weeks later. And if we’re optimizing for building quickly, the codebase can quickly become bloated - making it more difficult to maintain. With more moving parts involved, tests become harder to run, and the end product can often begin to slow down. But it’s not just an issue of wasted engineering resources and frustrated engineers. Experiments often create friction between product and leadership - who witness product changes being rolled back and minimal movement on metrics. In short, the whole process can feel slow, causing organizations to become skeptical of experimentation culture as a whole. So how do we create a process that brings together different departments within the organization?

A/B testing doesn’t scale

Experimenting live doesn’t scale easily. Teams are often running multiple experiments live in parallel, but this can quickly become a coordination effort to make sure one experiment does not interfere with the results of another. This coordination effort is why it’s important not to test for every possible idea, but to work on gathering evidence to help prioritize ideas predicted to be the most successful with the highest confidence level.

Airplane cockpit at night with sunset view outside — *Image Credit: Andrés Dallimonti*

Creating meaningful results is not easy

The results aren’t always clear. Those who regularly run A/B tests are aware of the ‘peeking problem’ which is when you do not wait long enough for the experiment to complete, taking the first sign of an uplift as a false positive. Even if there is a short-term uptick in key metrics, what does this mean for the long-term trend? You don’t want to be cannibalizing long-term success for a few quick wins in the short term. It’s important to look beyond individual product interactions and take into account an entire sequence of user activity. Taking this global view, helps to infer latent behaviors and predict the compound effect of small iterative experiments on long-term KPI's.

So, what’s the solution?

Instead of ‘build, measure, learn’, we think the future is to learn, simulate, and then measure. At the end of the day, no-one wants to be waiting weeks and in some cases even months to see if an experiment is even a good idea in the first place - using up resources in the process.

Ideally, you want to be able to triage before pushing tests live on your user base and exposing them to low-performing variants. It’s at this stage where you can make tweaks to your experiments or decide whether they’re worth pursuing altogether.

This is something you can do with our very own simulation environment, by simulating experiments before implementing live. Train AI models on your product interaction data, while taking into account context beyond your standard demographics such as, personality traits and cultural context which influence how different types of user profiles interact with the same product. We can help you create deep synthetic user profiles, which you can test your assumptions against. Scope out test variants and get an indicator of the impact these changes will have on your user base and the confidence level of the prediction. By modeling out changes, you have a proof point to help you prioritize which experiments to implement and test live as well as evidence to help get buy-in from your team.