A/B testing is now an integral part of digital marketing.  

But the tests can produce the wrong results if they are not conducted correctly. Here is part one of a three-part series about how you can use data science techniques to avoid making big mistakes with your A/B tests.

Ever since Wired revealed that Amazon ran hundreds, if not thousands, of tests on customers to optimize its site, A/B tests have been hailed as a great way to improve user experience and marketing.  

I mean, what could be better than running two different versions of your ads or your site and letting the clicks decide which one is better?

But in the midst of all the hoopla, people forgot that the tests were based in some well-established mathematical techniques - and man started using them incorrectly.  

They failed to ensure that they had enough test results and did not let the test run to the end. As a result, many A/B tests were not only invalid, but actually led people to make changes which reduced clicks and conversions. 

So, what can you do to avoid this?

Enter data science

Well, one thing you can do is bring the maths back into your testing by using techniques made popular via the recent interest in 'data science'.

What is 'data science'?

According to an accepted definition, data science is the intersection of statistics, hacking, and domain knowledge (digital marketing in our case). Do some of each on a problem and you are, arguably, using data science techniques to solve your issue.

  • The statistics help you frame your test so that before you even start, you know that the test will be valid.  
  • The domain knowledge ensures that when you're finished with the test that you have an actionable result.  
  • And the 'hacking' is about doing what you have to do to make sure you get the right data - without ruining the statistics, of course!

And, yes, using data science can be complicated. So to help you get started, I've put together three ways in which you can use well-accepted data science techniques to improve the reliability of your A/B tests. 

And, though complicated behind-the-scenes, they aren't hard to use. You put in data, the tools handle the stats, and you get results which are statistically valid - and are far more likely to be correct than if you didn't use data science.

Here are the techniques I'll cover:

  • A/B Test Sizing Calculator
  • Chi-squared Testing
  • Bayesian A/B Testing

And as each of the techniques requires some detailed explanation, I'll just cover one per post. Let's start with the first one...

A/B Test Sizing Calculator

As most people in marketing now know, when you do an A/B test you make a single change to your website or ad and then divert traffic evenly between the two.  

At the end of the test, you then see which performs better and make the change permanent. That is, you create an 'A' and 'B' version, test them against each other, and declare a winner.

What many marketers do not know, however, is that the results of A/B tests are often totally wrong. Why is that?

Well the main reason for this is that many of the differences between the 'A' and 'B' tests are just random. 

As with any test, there is typically a standard amount of difference each time you conduct it - and you need to be able to tell the difference between a significant difference and one which just happens randomly.

But how?

Enter statistics

Well, you can plan an A/B test using statistics and ensure that your test results will be valid before you start.  

Though the maths for this is complicated, there are a number of online A/B test sizing tools online which let you benefit from the data science techniques without ever touching Excel, much less an equation.

The A/B Sample Sizer

Have a look at the Sample Sizer on Evan Miller's site. It's simple, but offers very valuable feedback about the test you are about to do.

Basically, you just enter in a couple of data points it tells you how many tests you're going to need. Simple, right?

What data points?

OK, so here's where you need your domain knowledge. To use the calculator you need to make some good guesses about your marketing campaign.

You need to tell the calculator your:

  • Typical conversion percentage. How many out of 100 typically convert?  Try to come up with the average figure.
  • The minimum detectable effect. That is, do you want to know if 'B' is 2% better than 'A'?  Or are you OK with just being able to detect a 10% change?

The reason that you need to know these things is that there are two trade-offs when conducting an A/B test.  

  1. The lower your conversion rate, the more test cases you will need.  
  2. The lower the detectable effect, the more test cases you will need.

And that makes sense, if you think about it. I mean, if you want to see a whether a really slight change was real - or just random - then you need to look at a lot of examples, right?

So, using the excellent Evan Miller tool, I input some typical values...

..and wow, I need over 1,000 subjects per branch. That is a lot of tests for most sites - and by the time that many people come through, the test may no longer be relevant.

What can I do to get around that problem?

Hacking?

Now, here is where hacking may come in useful.  To run a A/B test with that many examples, you may have to get creative with how you get visitors.  

You could increase your ad spend for a few days or entice people to flow to the test from another part of the site. Anything really that will get you the visitors you need to validate your change. 

And if you need justification for the change or the increased spend, you're doing this in the name of science!  If you want good results, then you have to be prepared to do what it takes to get them.

Keep in mind, though, that if you make a major change, you risk jeopardizing the results. Remember, we're only supposed to be testing A and B.  If you add anything new, then you are essentially creating more variations - and then you need even MORE tests.

So...

I wrote about the A/B Test Sizing Calculator first so that you can understand the magnitude of the issue you face when sizing A/B tests. That is, in order for your tests to be statistically valid, you need a LOT of samples - almost certainly more than you think initially.

But knowing how many samples you need doesn’t necessarily solve your problem. So for the next two techniques – Chi-Squared Testing and Bayesian A/B testing – I will cover how to deal with a situation where you just can’t get enough tests to satisfy the sample calculator.

For more on this topic, read our posts on how to avoid an A/B testing nightmare and software recommendations from four ecommerce experts.

Jeff Rajeck

Published 6 November, 2014 by Jeff Rajeck

Jeff Rajeck is the APAC Research Analyst for Econsultancy . You can follow him on Twitter or connect via LinkedIn.  

194 more posts from this author

You might be interested in

Comments (4)

Pete Austin

Pete Austin, CINO at Fresh Relevance

There are two fundamentally different testing use cases:

(1) Tests when designing short-term marketing, for example deciding which email subject line is better for this week's newsletter.

False positives don't matter much here, so you can use small sample sizes

Suppose there's no real difference between the two subject lines, but you mistakenly decide "A" is better than "B". So what? There's little downside, because you need to send the email anyway and both subjects are similarly good, so you'll get similar results.

Here are some useful links (I've repeated one from the article because it's really good).

http://en.wikipedia.org/wiki/Student%27s_t-test
http://www.evanmiller.org/ab-testing/t-test.html
http://pareonline.net/getvn.asp?v=18&n=10

=========

(2) Tests when making long-term marketing decisions, for example suppose you are optimizing your ROI by varying website design - font, colors, spacing, order of items on menus etc.

False positives are a major problem here, so you must use big sample sizes and check results by repeating your experiments.

In this case, suppose you choose from 20 shades of blue for your buttons and decide #19 is better than the others. But that's a mistake and the truth is that all the shades all similarly good, and #19 was only the winner by chance. But you believe the test and standardize on this shade of blue.

There's a huge downside here, long term, because you've reduced your designers' freedom to produce a good-looking website, for no good reason. Do this enough times, on all the tiny attributes of your website, and it's going to look worse and perform worse.

In this use case, "false positives", where you think a decision had a beneficial effect on ROI when in fact it makes no difference, can be really bad, so you need to use great care and big samples.

The following link seems pretty good for use case 2 (PDF):
http://www.qubitproducts.com/sites/default/files/pdf/most_winning_ab_test_results_are_illusory.pdf

about 3 years ago

Jeff Rajeck

Jeff Rajeck, Research Analyst at EconsultancySmall Business

Thanks for that lengthy, and well-written addition!

I agree with your distinction. A lot of the off-the-cuff marketing falls into category 1), so you may not have to worry about the stats every time.

But I think mistakes with category 2) happen all of the time. We *think* we know when one test outperforms - and that belief is encouraged by analytics and ad platforms. (For this reason, I never let FB or LI optimize which ad to show).

So a 2 second review of your results using Evan Miller's calculators can at least help us to question 'obvious' results in a good way.

That Qubit link is great. Aanother useful talk is Jason Cohen's video about A/A testing - and why testing '41 shades of blue' is pretty much always a bad idea.

http://businessofsoftware.org/2013/06/jason-cohen-ceo-wp-engine-why-data-can-make-you-do-the-wrong-thing/

about 3 years ago

Daniel Lee

Daniel Lee, Web Analytics Manager at Evans Cycles

Great article - thanks for writing!

I personally believe that where most testers go wrong with AB Testing are;
1) Stopping the test after getting the result they want to see, far too quickly
2) Not running AB Tests which solve issues visitors are experiencing
3) Not understanding that you should test big changes to reach the optimal result sooner and test small changes to understand what content makes various desired effects

There are also other things users can do to get more traffic into a page which doesnt receive much, such as linking to it from a busier page or, as you mentioned, invest in paid traffic to the desired page.

about 3 years ago

Jeff Rajeck

Jeff Rajeck, Research Analyst at EconsultancySmall Business

Totally agree Daniel.

I think the initial approach in A/B testing - well MY initial approach anyway - was to test and change things that were easy. Button color, copy, etc.

What I now realize is that a lot of the 'winners' in those tests were illusory - and the button color really didn't matter.

Big changes, however, produce much better results - and not always good ones. Which is why we really need a/b tests!

about 3 years ago

Comment
No-profile-pic
Save or Cancel
Daily_pulse_signup_wide

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Digital Pulse newsletter. You will receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.