Skip to main content
All CollectionsReporting & AnalysisStatistics Engine
Understanding False Discovery Rates
Understanding False Discovery Rates

How to analyse statistical significance, confidence intervals and chance to beat control on AA and AB tests

James Harber avatar
Written by James Harber
Updated over 10 months ago

How can we make sense of data that looks like this?

Confidence Level


When analysing data from tests, the first thing to decide upon is your "Confidence Level". This is your tolerance for risk - the higher the value, the lower your tolerance for risk, and the more certain you will be that any significant test results are due to differences in the experiments and are not caused by random chance. The default, but by no means mandatory value, is 95%. It can be toggled on the fly from within the reporting screen.

A 95% confidence level is two-tailed. which means that significance will be achieved when the chance to beat control is either below 2.5% or above 97.5%. It is also a risk tolerance decision that 95% of the time, or 19/20 times, our results will be "correct", and 5% of the time, or 1/20 times, we will get an "incorrect" result due to the inherent randomness in statistical sampling.

When analysing test results, some level of judgement needs to be applied and any confidence level decision is a trade-off, of certainty you are right against the length of time needed to run the test, and the possibility of wrongly rejecting winners.

Application to data analysis


In terms of an AB test with a 95% confidence level, you would expect any significant result to be truly significant 19/20 times, and the other 1/20 times it would be a false positive or false negative, i.e. you might declare a winner when there was actually no statistical result.

In terms of an AA test, we already know that the "correct" result is for there to be no significance because we are not doing anything to the page. With a 95% confidence level, we would expect to see a false positive or false negative 5% of the time, i.e. we will see a statistically significant result when there isn't one. That is to say, that 1/20 AA tests will show a statistically significant result for your KPI. If metrics can be assumed to be independent, and you run an AA test with around 20 metrics, we would reasonably expect to see one of them showing as statistically significant.

This can be generalised by applying the same logic to other confidence levels.
For example, for any metric on an AA test, we would expect its chance to beat control to be:

  • between 25%-75% around half of the time.

  • lower than 10% or higher than 90% around 1/5 of the time.

  • lower than 2.5% or higher than 97.5% (significant) around 1/20 of the time.

Example

Let's look at the data from the first screenshot again.


On first glance, the metric PAGE_ENQUIRY seems to look like it has a promising lift on all experiments. Experiment 5 in particular has a 6% lift at a 91% chance to beat control.

In reality, these values are taken from an AA test, and there is no difference between any of the experiments or control. If we look at the data another way, 5 out of 8 of the chances to beat control fall between 25%-75%, which is about half. 1 of the 8 chances is either <10% or >90%, which is about the 1 in 5 we would expect. None of the chances to beat control are significant, which is what we would expect as there is around a 1 in 20 chance of this happening and we only have 8 values.

These results are completely within expectation for an AA test, which is why it is very important to stick to your confidence level strictly when analysing test results, and waiting for significance to avoid jumping to any incorrect conclusions.

Best practices when drawing conclusions

  1. Use a high enough confidence level.

  2. Make sure that statistical significance has been reached.

  3. Run the test for a reasonable period of time - minimum 2 weeks, but longer is better.

  4. Run the test until you have a reasonable level of traffic and data - minimum 1000 views and 200 conversions per experiment, but more is better. If this is difficult to achieve on the primary KPI then look to track behaviour and intent and look for trends and direction.

  5. Check the stabilisation chart for the lines to run parallel.

  6. Check that the trend is consistently observed over different time periods.

Did this answer your question?