Using data science with A/B tests: chi-squared testing

In my previous posts about A/B testing, I made the case that you need to consider the math behind A/B testing, or risk having invalid, or even wrong, results.

My first suggestion is to use sample sizing, but that requires a lot of tests.

Here’s how to do something similar without nearly as many.

By Jeff Rajeck November 5th 2014 05:45

With just a little bit of analysis you can check the validity of your A/B test before you even conduct it. One way you can achieve this is by sizing the test beforehand.

You tell an online tool your typical conversion percentage and what minimum detectable effect you’re looking for – and the tool tells you how many tests you need to conduct.

But what if the number of tests is just far too many for you? How else can you validate your A/B test statistically?

In order to answer that question, it’s necessary to have a quick look at the theory behind A/B testing – with some examples to make things clear. And, again, we’re going to use the data science model of using statistics, domain knowledge, and hacking to arrive at our answer.

First, the stats

So when you do anything repeatedly you will get different results, of course. But these ‘different’ results will typically follow a pattern.

One set of results will happen a lot more frequently than the others. And if you graph how often particular results occur, you will end up with a ‘distribution’ of results.

To make this point clearer check out the graph below. It shows the ‘distribution’ of flipping a coin 10 times. Usually it’s heads five times, but sometimes you get extreme values – one or even 10. The distribution is like a hill with its peak in the middle.

Now say instead of flipping a coin you flipped a pan with a cast-iron bottom. It could come up face-down, but it would be far more likely to land with the heavy part on the ground.

And maybe the distribution for flipping it 10 times would look like this:

And if we compared the two on the same graph, it would be obvious that we were flipping two very different things – even though they overlapped a few times.

And this is what we want to know about our A/B tests – was one significantly different that the other? Or were they so similar that the difference we observe could just be random.

This is what chi-squared testing tells you. It can tell you whether B was statistically different than A – or whether it’s too close to call. And it can do this with a remarkably few examples.

Let’s have a look at a real example.

Chi-squared testing

So we’re going to do exactly the same thing we did above – except with conversions instead of coins and frying pans.

Think of the ‘A’ test as the coins, and the ‘B’ test as the frying pans. In the flipping tests above, there was a clear difference between the two. Now we want to know whether it is the same for your conversions.

Running the test

Though the math behind chi-squared testing is quite complicated, there is no need to worry about the details as all of the calculations are handled in the tool.

And there’s nothing to interpret, either. The test examines your results and tells you immediately whether the results are different. That is, whether they have a significantly different ‘distribution’ – just like the coins and the frying pans.

Now, it won’t tell you the magnitude of the difference – just whether the difference is significant. But the trade-off is that you don’t need nearly as many samples as you did with the test sample sizer.

So how to use it

Here’s where your domain knowledge comes in useful – you need to run the tests and input in the results.

First go to Evan Miller’s A/B testing tools, but select the Chi-Squared Test.
Then input your data – what are the two results you would like to compare?
And – boom – you get your results immediately.

The hacking

If you see that there is no significant difference, then your ‘B’ results are not significantly different from your ‘A’. It doesn’t mean that they aren’t different – they almost certainly are – but the differences you see could have happened randomly.

Now, you can hack the test – and reduce the confidence level and see whether that ‘helps’ find a difference. This works because a lower confidence level will show a winner with less difference than a high one.

Do heed the warning, though, that this level is the % of time the test is ‘right’. That is, if you set the confidence level to 90%, the test is only ‘right’ nine times out of 10*.

Alternatively, you can head back to the sizing calculator and find out how many tests it would take to prove such a difference. Too often, though, the traffic required is far beyond our hacking capabilities.

So…

The best way to start with chi-squared testing is to fiddle with the numbers so you get a feel for what is – and what is not – significant. It’s very illumanating and will probably cast your mind back to previous tests which may have been way too close to call.

But what about times when you don’t have many samples – and chi-squared testing is inconclusive? Is there anything else you can do with low traffic and small(ish) differences in results?

There is. Stay tuned for Part III – which uses a technique developed by a famous English Presbyterian minister (who also happened to be a statistician!).

* Stats experts may cringe at that explanation, but it is close enough for our purposes!