Horror story 1: Variation and the case of the flipping coin
One travel website wanted to redesign its entire booking funnel. The plans were drawn up and the design was put live to 50% of its visitors in order to gauge its effectiveness.
Early results looked great and suggested a fairly significant revenue uplift. Within three days, the manager asks why they were not showing this to the other 50% of visitors since the results looked great on collected data to date.
So the experiment was thrown into the site code, despite the small sample size.
A couple of weeks later, that same manager is left wondering why revenues have now significantly fallen. The new booking funnel is suspected and finally the A/B testing is done to full sample size.
The actual result of the test in full was a 10% down lift in conversion rate.
Moral of the story
Ever flipped a coin to find out who gets that last piece of cake? Well the first few times you flip it, it’s quite possible that you would get either all heads or all tails. But if you flipped it 100 times, it would even out to 50/50. The more you do a test, the more accurate your chances become, the less random the results.
Sample size and length of test need to be long enough to shift the balance away from statistical randomness. Looking at test results too early or every day means you might see a strong uplift before a significant sample is collected. Variation is especially pronounced at the start of a test.
So, it’s back to the drawing board! But testing properly saved this company a significant amount of money.
2: The tale of false positives and the pregnant man
Multivariate testing (or MVT) is a common method used in A/B testing. It works so that various combinations of site elements: colour, positioning, copy etc are served to segments in an attempt to find their most effective variant.
One company undertook MVT with 100 variants and they waited until some results made statistical sense; (apparently) following testing best practice!
A few experiments were won and their ecommerce director happily decided to implement one variant into their code. Yet at the end of the year, they had trouble proving revenue uplift despite the win.
When they came to us, we re-ran the tests for a validation phase and they resulted in either zero impact or were genuinely damaging conversion. Result of the original test: a loss of time and money.
Moral of the story
Without validating your winning variations through a second testing phase, you don’t know whether they won because of a true effect or because of random variation.
The chance of false positives is high with 1000 variations – something will ‘win’ just by chance. Variation means that if a man takes a pregnancy test a 1000 times, at least one time it would produce a false-positive!
If you notice you’ve had lots of winning tests but no increase in your bottom line, ask your testing company: what is your false positive rate?
3: Early data and the sad story of leaving at half-time
A different travel firm invests in thousands into their new product page design and puts it live as a 50/50 experiment.
After the first week, management were calling to halt the test – conversion rates had crashed! But because they stopped before the test had reached statistical power, they had no way of knowing the true effect of the test.
A disgruntled marketer did the calculations, and just before they were about to pull the plug on it, proved that the probability of the negative result being a true one was less than 30%.
They showed their working to management, and were given the opportunity to let the test run to completion.
The new product page design was shown to be conclusively more successful, and the company decided to alter their marketing KPIs to be in line with revenue driven, rather than the uplift of tests.
Moral of the story:
Early data reached before significance is likely to be the result of random variation. Stopping before an experiment has reached significance reduces your power, the chance of detecting a true effect, thereby potentially missing a winning variation.
Also, a new site design that interrupts the flow of visitors in the middle of their research phase may have a short term negative impact, your test must have a strong hypothesis grounded in the context of the experiment. Ensure stakeholders know this.
A/B testing can be a minefield. Make sure your testing process is rigorously aware of the statistical traps that can lose you time and money.
Don’t count your chickens before they’re hatched! You might be losing at half -time and tempted to leave the stadium…