Using data science with A/B tests: Sample sizing

Ever since Wired revealed that Amazon ran hundreds, if not thousands, of tests on customers to optimize its site, A/B tests have been hailed as a great way to improve user experience and marketing.

I mean, what could be better than running two different versions of your ads or your site and letting the clicks decide which one is better?

But in the midst of all the hoopla, people forgot that the tests were based in some well-established mathematical techniques – and man started using them incorrectly.

They failed to ensure that they had enough test results and did not let the test run to the end. As a result, many A/B tests were not only invalid, but actually led people to make changes which reduced clicks and conversions.

So, what can you do to avoid this?

Enter data science

Well, one thing you can do is bring the maths back into your testing by using techniques made popular via the recent interest in ‘data science’.

What is ‘data science’?

According to an accepted definition, data science is the intersection of statistics, hacking, and domain knowledge (digital marketing in our case). Do some of each on a problem and you are, arguably, using data science techniques to solve your issue.

The statistics help you frame your test so that before you even start, you know that the test will be valid.
The domain knowledge ensures that when you’re finished with the test that you have an actionable result.
And the ‘hacking‘ is about doing what you have to do to make sure you get the right data – without ruining the statistics, of course!

And, yes, using data science can be complicated. So to help you get started, I’ve put together three ways in which you can use well-accepted data science techniques to improve the reliability of your A/B tests.

And, though complicated behind-the-scenes, they aren’t hard to use. You put in data, the tools handle the stats, and you get results which are statistically valid – and are far more likely to be correct than if you didn’t use data science.

Here are the techniques I’ll cover:

A/B Test Sizing Calculator
Chi-squared Testing
Bayesian A/B Testing

And as each of the techniques requires some detailed explanation, I’ll just cover one per post. Let’s start with the first one…

A/B Test Sizing Calculator

As most people in marketing now know, when you do an A/B test you make a single change to your website or ad and then divert traffic evenly between the two.

At the end of the test, you then see which performs better and make the change permanent. That is, you create an ‘A’ and ‘B’ version, test them against each other, and declare a winner.

What many marketers do not know, however, is that the results of A/B tests are often totally wrong. Why is that?

Well the main reason for this is that many of the differences between the ‘A’ and ‘B’ tests are just random.

As with any test, there is typically a standard amount of difference each time you conduct it – and you need to be able to tell the difference between a significant difference and one which just happens randomly.

But how?

Enter statistics

Well, you can plan an A/B test using statistics and ensure that your test results will be valid before you start.

Though the maths for this is complicated, there are a number of online A/B test sizing tools online which let you benefit from the data science techniques without ever touching Excel, much less an equation.

The A/B Sample Sizer

Have a look at the Sample Sizer on Evan Miller’s site. It’s simple, but offers very valuable feedback about the test you are about to do.

Basically, you just enter in a couple of data points it tells you how many tests you’re going to need. Simple, right?

What data points?

OK, so here’s where you need your domain knowledge. To use the calculator you need to make some good guesses about your marketing campaign.

You need to tell the calculator your:

Typical conversion percentage. How many out of 100 typically convert? Try to come up with the average figure.
The minimum detectable effect. That is, do you want to know if ‘B’ is 2% better than ‘A’? Or are you OK with just being able to detect a 10% change?

The reason that you need to know these things is that there are two trade-offs when conducting an A/B test.

The lower your conversion rate, the more test cases you will need.
The lower the detectable effect, the more test cases you will need.

And that makes sense, if you think about it. I mean, if you want to see a whether a really slight change was real – or just random – then you need to look at a lot of examples, right?

So, using the excellent Evan Miller tool, I input some typical values…

..and wow, I need over 1,000 subjects per branch. That is a lot of tests for most sites – and by the time that many people come through, the test may no longer be relevant.

What can I do to get around that problem?

Hacking?

Now, here is where hacking may come in useful. To run a A/B test with that many examples, you may have to get creative with how you get visitors.

You could increase your ad spend for a few days or entice people to flow to the test from another part of the site. Anything really that will get you the visitors you need to validate your change.

And if you need justification for the change or the increased spend, you’re doing this in the name of science! If you want good results, then you have to be prepared to do what it takes to get them.

Keep in mind, though, that if you make a major change, you risk jeopardizing the results. Remember, we’re only supposed to be testing A and B. If you add anything new, then you are essentially creating more variations – and then you need even MORE tests.

So…

I wrote about the A/B Test Sizing Calculator first so that you can understand the magnitude of the issue you face when sizing A/B tests. That is, in order for your tests to be statistically valid, you need a LOT of samples – almost certainly more than you think initially.

But knowing how many samples you need doesn’t necessarily solve your problem. So for the next two techniques – Chi-Squared Testing and Bayesian A/B testing – I will cover how to deal with a situation where you just can’t get enough tests to satisfy the sample calculator.

For more on this topic, read our posts on how to avoid an A/B testing nightmare and software recommendations from four ecommerce experts.