Skip to main content
All CollectionsReporting & AnalysisStatistics Engine
Multi Armed Bandits (MAB) - The Theory
Multi Armed Bandits (MAB) - The Theory
James Harber avatar
Written by James Harber
Updated over a year ago

How traditional experimentation works

Traditional experimentation usually operates on a 2-phase approach. You run a "fair experiment", e.g. 33% x 3 variations, until your test reaches conclusion, and then the winner is finally shown to all users.

There are some benefits to this. Fair testing, for one, relies on an even distribution of traffic amongst your variations, with SRM being an understood concern when running your tests. Randomisation is a cornerstone of good and fair experimentation.

Another is the impact of seasonality on your variations. Lets say Variation 1 converts well in the morning and poorly in the evening, but Variation 2 does the opposite. Given a whole day of data, they will get an even chance of exposure and will even-out in their impact. This is critical when the goal post-test is to run a single variation to all segmented-users.

There are drawbacks, though. Time to exploitation - how long it takes for you to reap the rewards of a winning variation - is challenging with a traditional A/B test.

In a traditional A/B test, you might have to wait until the test is concluded before implementing the winning variation for all users. This waiting period can be a significant bottleneck, especially when a clear winner has emerged early in the testing phase. It means that you're missing out on potential gains that could be achieved by quickly adopting the successful variation.

This is what the multi-armed bandit seeks to improve.

Multi Armed Bandits

The multi-armed bandit approach addresses the time to exploitation issue by dynamically adjusting the traffic allocation based on the real-time performance of each variation.


Instead of waiting for the test to conclude, the multi-armed bandit algorithm continuously optimizes the traffic distribution, favoring variations that show promising results. This adaptive approach allows for quicker exploitation of successful variations, maximizing the benefits of the experiment in a more agile manner. In essence, the multi-armed bandit is designed to strike a balance between exploration (testing less-proven variations) and exploitation (allocating more traffic to the current best performer) throughout the testing process.

The primary time at which we suggest these are employed is when you have content and a short window for exploitation. Black Friday sales, for example, have users who exhibit behaviour in a unique way. If running a test, you'd want to find the best possible content as quickly as possible, and should appreciate that past learnings may not apply during this sale period.

The key benefit to Multi-Armed Bandit is time to value. Instead of running an experiment for 2 weeks, or to 100k users before finding value, you can start shifting traffic in favour of winning experiments as soon as there is data to support a skew.

The key limitation to the Multi-Armed Bandit approach is that in small numbers, tiny changes in behaviour can look drastic. 1 conversion vs. 0, or even 19 vs 20 look like a substantial difference in conversion rate. As always, lower-traffic websites need to be far more cautious using this approach, and the steps at which skews should be applied.

A consideration is also the balance between exploration (safely gathering data to analyse) and exploitation (making the most of the insights you've gathered). The more data you can gather, the more accurate a decision you can make about what others should see. But, at the same time, this limits your ability to exploit / benefit from the situation. Multi-armed bandits seek to strike a balance at some end of the scale between these two levers.

Multi Armed Bandits statistics in Webtrends Optimize

Information accurate as of Nov 2023.

In Webtrends Optimize, you begin by selecting a key metric.

Once done, the Multi Armed bandit approach optimizes to this metric - calculating each variant's Chance to Beat All on a scheduled basis.

This approach, which focuses more on exploitation than exploration, skews each variant based directly on it's individual comparative performance.

Consider the following views and conversion counts:

  • Control: 1000 views, 100 conversoins

  • Variation 1: 1000 views, 95 conversions

  • Variation 2: 1000 views, 101 conversions

This would produce the following Chance To Beat All values:

  • Control: 38%

  • Variation 1: 19%

  • Variation 2: 43%

The multi-armed bandit approach would apply these percentages as the traffic allocation to the respective variations, moving away from 33.3% each to:

  • Removing a lot of traffic from the underperforming Variation 1

  • Adding some of this traffic to the Control group

  • Adding more of this traffic to the better-performing Variation 2

The hope with this approach is that as trends continue in this direction, the skews push further and further into an absolute winner, but with the underperforming variation getting less traffic, the better-performing variation has more traffic to prove its worth and it's improvement over the control group.

What happens if behaviour changes over time?

If trends do not remain constant, and behaviour flips, you'll find that the Conversion Rate and consequent Chance to Beat All values will change alongside this.

Changes in behaviour will affect higher-traffic variations more, and so these swings will be easier to detect and respond to.

Despite a variation at some point potentially having 50%+ of traffic for itself, data and trends changing could cause the variation to drop to a much smaller traffic share.

Did this answer your question?