Six ways CROs botch their statistics

On the “scientist” front, a strong understanding of statistics can be an obstacle. CROs tend to be good systematic thinkers, but even “real” scientists can botch statistics. To help prevent this, we’ve put together six common ways CROs can make mistakes with their statistics.

1. Not understanding statistical significance

Statistical significance is poorly understood by most marketers, and to their credit, CROs tend to be better informed than others. Despite that, most CROs still lack a formal education in statistics, so it’s not surprising that there are some misconceptions about what statistical significance is, why it’s important, and the practical implications that come with it.

Nowadays, most CROs are using split testing tools that either tell them when their test has reached statistical significance, or tell them what the statistical significance of their test currently is. CROs generally assume they are safe to make a decision if their tool says that they have reached statistical significance, or if their tool is giving them at least 95% statistical confidence.

Google optimize, for example, offers a “Probability to be best” and a “Probability to beat baseline.”

And Optimizely reports a statistical significance status:

But what you might not realize is that these tools are reporting entirely different kinds of statistical significance, and yes, there can be real world consequences if you fail to understand this.

For example, most CROs would probably assume that the 91% in the Optimizely report for the loser above has essentially the same meaning as the Google “Probability to beat baseline.” But it doesn’t.

This is because Google Optimize is based in Bayesian statistics, while Optimizely, and most tools, are based in frequentist statistics.

So, what does that mean, and why does it matter?

Well, Bayesian statistics have a more intuitive meaning. The “probability to be best” and “probability to beat baseline” are exactly what they sound like.

But the frequentist model that most tools are based on is much more counterintuitive. To say that your test has reached 95% confidence isn’t to say that there’s a 95% chance your landing page will outperform the original.

Instead, 95% statistical confidence means something much more counterintuitive. It means that if your A and B pages performed identically in the long run, there would only be a 5% chance that we would see at least this big a difference in results. You can’t actually invert this to mean there’s a 95% chance the better performing page will perform better in the long run.

What this means, practically speaking, is that the Bayesian method is more conservative, and that what most tools report as statistical confidence is not the kind of confidence many CROs think it is.

Setting aside the details of what form of statistical significance you are using, the most common error CROs make is running with the assumption that statistical significance can be equated with certainty. No matter what your decision rule and how strong your statistical significance, there is always the possibility that your result was due to chance. If your long term results aren’t living up to your expectations, odds are the test result was a fluke. CROs need to accept that this happens and adapt accordingly.

2. Getting fooled by bots

Bot traffic can skew both analytics data and split test data. When bots are sent to one version of a page but not the other, they inevitably return false experimental results. When those results look good for the new page, CROs often run with the results and end up publishing a poorly performing page. Bots can also skew test results the other way, causing you to rule out high performing landing pages.

This happened to BigTreeMarketing, when bot traffic made it look like their new landing page was performing poorly compared to the control, despite the new landing page being just a very minor alteration to a previous high performing page:

If they hadn’t realized that a performance monitoring bot their client was using was being sent to the new product page, but never the old, they would have ended up discarding the new product page. This would have been a mistake, because after filtering the bot, the new page performed better:

Keep an eye out for the following to identify if bot traffic is interfering with your tests:

Unusual spikes in traffic, especially if they don’t correspond to any change in overall conversions or sales
Dramatic shifts in other behavioral metrics, such as bounce rate or time on site or page
Increases in direct traffic, or traffic from unusual sources
Traffic from locations or including language sources that your audiences doesn’t typically come from
Unusual shifts in the number of visits using specific browsers or devices
An increase in the number of pages visited per user, especially if the time on site is short

In addition to monitoring your site for bot activity, set up your split tests so that the method of sending users to page alternates is identical. In this way, even if you are receiving bot traffic, at least the bot traffic will be split in even proportions between your page alternates.

3. Thinking correlation is causation

Confusing the fact that two things tend to happen together for the idea that one causes the other is an easy trap to fall into. The SEO industry infamously confused a high correlation between Google +1s and search engine rankings with a ranking factor, but Google’s own Matt Cutts quickly told them they were wrong:

In simple terms, correlation is a measure of how often two things seem to go together. For example, you would typically expect a relatively high correlation between the amount of traffic you receive and the amount of sales you earn. But a correlation between two things doesn’t necessarily mean that one causes the other.

One obvious example would be the backwards interpretation of the correlation between traffic and sales. The direction of causation is obvious: more traffic tends to cause more sales. But correlation statistics can’t tell the difference between that and more sales causing traffic. In fact, we might even be too quick to assume that some of the correlation isn’t actually caused in this direction. After all, some buyers may tell their friends about the purchase, which could increase traffic. Correlation on its own can’t tell us the difference.

Correlation can also imply a shared cause. For example, changing the landing page might cause an increase in conversions as well as time on site. If we weren’t aware of the change in landing page, we could falsely conclude that the resulting correlation between conversions and time on site meant that increasing time on site would increase conversions, or vice versa, which isn’t necessarily true.

Finally, correlation can also be entirely spurious. For example, two well-known online trends are that mobile traffic is increasing and digital ad spending is increasing. Since they are both increasing, statistical analysis would find a correlation between them. That doesn’t mean they have any cause-effect relationship at all. They simply both happen to be increasing with time.

Many of us who operate on the more technical side of online marketing, such as CRO and SEO, are familiar with the phrase “correlation is not causation.” Unfortunately, even being familiar with the phrase doesn’t prevent us from falling into this trap if we don’t take steps to stop this from happening.

When we make this mistake, we end up ritualistically performing tasks that don’t help (or even hurt) results, or falsely attributing negative results to actions that may actually have had a positive impact.

To systematically avoid falling into the correlation=causation trap, do the following where possible:

When you identify a correlation, don’t decide on a cause, hypothesize one

The difference is that a hypothesis is testable. While split testing isn’t always a possibility, experimenting in some way always is. Never assume that a prior correlation will continue forever. Instead, periodically test your hypothesis against the data.

Example: If you find that increasing your blog posting schedule correlates with an increase in sales, and you hypothesize that this is the cause, you shouldn’t just increase your blog posting schedule indefinitely under this assumption. You should increase it and measure if there is a corresponding increase in sales. Likewise, if for some reason you reduce your posting schedule because you no longer believe it increases sales, you should measure whether there is a negative impact on sales.

Do run a split test wherever possible, at least for any correlation that seems to have a meaningful impact on your KPIs

Just as importantly, when you can’t run a split test, always remember that your hypothesis about the cause of a correlation hasn’t been verified experimentally.

Example: If you see a few case studies showing that testimonials featuring people’s faces boosts sales, then simply implement the change without split testing, you can’t be entirely sure any change in sales is actually the result of including those face testimonials, because it could be spurious correlation. An A/B or multivariate test is the only way to confirm this for sure.

When you can’t run a split test, control for variables using a method like regression analysis, if you have access to the appropriate statistical tools

This way you can at least rule out the most obvious alternative explanations for the correlation.

Example: Let’s say you identify a correlation between time on site and sales. You might be tempted to design landing pages that maximize time on site in order to boost sales, but you don’t have any way of directly testing that this will work. You can split test pages for time on site and for sales, but there’s no experiment you can run specifically to determine if time on site directly causes an increase in sales. What you can do, however, is use regression analysis to control for at least some other variables such as bounce rate, traffic source, landing page, and so on. While you can’t entirely rule out spurious correlation, you can at least rule out other obvious causes.

None of this should be allowed to get so cumbersome that it inhibits your ability to act, but even being mindful of the limitations of your knowledge will make you a far better conversion rate optimizer.

4. Confusing statistical and practical significance

Returning to statistical significance for a moment, it’s important to recognize that just because an effect has strong statistical significance, this doesn’t mean it’s practically meaningful.

You can run a split test on two landing pages and achieve a result with 99.9999% statistical significance, but if the end result is that your conversion rate has increased from 2.1% to 2.2%, you may consider the test an important learning experience, but it’s a failure in terms of producing a meaningful business impact.

According to research by Optimizely, most split tests result in a minimal change in outcome:

Here we see that most changes hover very close to zero. The average change is +6%, statistically insignificant, and the median change is actually -1%.

While this conclusion is fairly obvious, CROs still often confuse the practical and statistical significance of their tests.

We’ve all caught on to the idea that we shouldn’t act on a split test unless you have a statistically significant result, but many CROs, especially marketers who are just starting to dabble in it, confuse this for the idea that we need to run all tests until they reach statistical significance.

The unfortunate reality is that the practical difference in performance between two alternate pages is often so negligible that you will never be able to feasibly run a test long enough to determine a winner.

If your conversion rates are jockeying back and forth for over a month and no meaningful difference in performance is arising, odds are your time is better invested in a different test. If the difference in performance is meaningful, it shouldn’t take a massive sample size or an extended period of time to identify a winner.

A closely related mistake is taking the old “test one thing at a time” adage too seriously.

While it’s true that you should test only one thing at a time, the one thing you should test is the thing you want to learn about and improve, which can be as small or as big as you are interested in.

If you want to find out if the color of a button is going to impact your conversion rate, then yes, you should only change the color of the button and nothing else.

But if you want to test the central message of a landing page, you should probably be testing two entirely different landing pages designed around the messaging, rather than creating a hodgepodge mishmash of messages by changing only one-page element at a time.

The truth is that, with few exceptions, small things like changing the color of a button rarely have a practically significant impact, and you should rarely be testing things at a level that granular unless you have already tested far more impactful things like the central messaging of a page.

Do not run every test until it reaches statistical significance, and always test the largest practical change you are interested in learning about. Abandon “failed” experiments early and use the knowledge that the impact was minimal to inform your future experiments.

5. Disregarding traffic source

Visitors from different traffic sources will behave differently, and if you don’t take this into consideration when you develop and evaluate tests, you can end up shooting yourself in the foot-long term.

Consider the following scenario. You develop a high-pressure landing page to test against a landing page with a much softer sell. You run an A/B test on it using AdWords traffic, and find that the high pressure landing page performs better. After updating the page, you shut off your AdWords campaign because the ROI still isn’t justifiable. But after ending the AdWords campaign, you find that your conversion rates plummet compared to where they were before you ran the campaign, and your overall revenue goes through the floor.

What happened?

Your AdWords traffic was primed to buy. The hard sell worked on this audience because the people clicking your AdWords ads were searching for higher converting keywords and clicking on ads with high pressure ad copy. This was the traffic that was being converted by your new landing page.

When you canceled the AdWords campaign, all that was left was the traffic from other sources. These sources reacted worse to the new landing page than the soft selling page, so your performance actually went down with the AdWords traffic removed.

If you fail to incorporate traffic source into your split tests, you will not be able to predict when things like this will happen. While it isn’t always necessary to reach statistical significance for every traffic source, you should at least track conversion rates for each source, so that you can be reasonably sure which traffic sources show an improvement, and which don’t.

Ideally, a high performing landing page will do well on all traffic sources, but this isn’t always the case and sometimes it’s impossible due to the nature of the audiences. This information is important, especially if you know which traffic sources you plan to scale in the future and which may be temporary. Never optimize conversions for traffic sources that are temporary or that you expect to grow slower than others.

6. Ignoring micro conversions

Maximizing “final” conversions may be the primary goal of CRO, but ignoring the role that micro conversions play in picking up those final conversions is a mistake. So, to start, it’s imperative that you actually measure them in the first place.

Micro conversions include things like:

Clicking from the homepage to a product page
Adding an item to the shopping cart
Starting the checkout process

Micro conversions act as bottlenecks on the way to final conversions, and hoping to improve final conversions without widening those bottlenecks can be counterproductive.

Here are a few ways that micro conversions can botch your statistics if you aren’t careful:

If a micro conversion further down the funnel is preventing sales, then it’s possible no amount of changes to a page further up the funnel will improve sales. For that reason, if you ignore micro conversions, it’s possible that no split test will yield definitive results. But if a page is increasing micro conversions further up the funnel, it should often be considered a winner, and the other micro conversions should be dealt with separately.

On the flip side, improving micro conversions in one place can also simply decrease micro conversions later. For example, moving shipping costs to the end of the checkout process might improve micro-conversions like entering credit card info, but ultimately do no good because they increase shopping cart abandonment. Merely looking at final conversions will give you an incomplete picture in which nothing is changing, when in fact the lesson may be that shipping should be “free” and incorporated into the product price, for example.

There may also be cases where decreasing micro conversions actually increases later micro conversions or final conversions. For example, addressing common user objections up front may reduce initial click-throughs by increasing friction at the first button, but they may reduce hesitation during every later micro conversion and ultimately result in increased sales.

It’s important to measure the practical and statistical significance of your micro conversions, as well as the way that they play off of each other, in order to optimize your conversion funnel.

Conclusion

While understanding these six ways that statistics and CRO play off of each may not compare with a PhD in stats, it will give you an edge over many of your competitors in the industry.

Understand statistical significance and how it’s different from practical significance, factor in traffic source, don’t neglect micro conversions, look out for bots, and check yourself on causation versus correlation. Systematize those values into your process and you will go far.

Subscribers can download Econsultancy’s annual Conversion Rate Optimisation report, in association with RedEye, for more on CRO trends.