Advert testing is critical to the continuous improvement of an Adwords campaign. It’s a reasonable bet that many of your competitors are testing new adverts, and hence improving their click through rates and conversion volumes over time.

Where do you think those clicks and sales are coming from?

If your competitors are steadily increasing their click through rate, and you aren’t, your own CTRs are likely to drop steadily over time, even if your adverts remain the same.

And with that is likely to come a gradually reducing Quality Score, leading to higher cost per clicks or lower positions. It’s like standing on a down-escalator: the only way to stay in the same place is to keep trying to move up.

Clearly, you need to improve your adverts over time, and that means a:b testing. The principle is simple, you write a new advert, display it half the time (with the established advert appearing the rest of the time), and once the difference in performance is significant, you keep the better advert.

There are many complications you can add to this,  looking at conversions or revenue, displaying the established advert more often and running more than two adverts at a time are just a few.

Ultimately though, if you want to be confident which advert is better, sooner or later you’ll have to run a significance test to determine whether the difference in performance is ‘genuine’ or simply the result of random variation. But it’s in the testing that things start to become a bit more complicated.

The test that most people use is Student’s T-Test, which compares the average performance of each advert, and estimates how likely it is that the typical variation that they see from day to day is due to random variation, as opposed to a genuine difference in performance:

But this test assumes that the average doesn’t change. In general, if you’re looking at conversion rates, this isn’t a problem, but the click through rate of adverts does change over time, and it’s not random.

Even if you don’t change your bids for the duration of the test, you’re still going to see some variation. And different keywords in the same Ad Group may get more or less searches each day, so if they have different click through rates, the overall click through rate will vary.

The point of all of this is that if a significance test throws up a 95% significance level (i.e. the probability that the difference isn’t due to random variation is 95%), the true significance is probably greater. This being the case, you should probably opt for a lower significance level than you normally would.

It’s interesting that this is probably a good idea anyway, waiting for a significant difference on a test may not yield the best results in the long run.

There is a trade-off when you decide how long to wait before declaring a winner in the test. On one hand, you don’t want to choose the wrong winner, but the sooner you pick a winner, the more tests you can get through.

Given that the longer the test takes to give a result, the closer the two adverts are likely to be in performance, it’s likely that the cost of choosing the wrong advert is likely to be lower, the longer a test goes on.

So perhaps the answer is this: once an advert has a healthy lead over another, don’t wait for a significant result, end the test, take the (probable) improvement, and move on to the next test.

Even if you choose the wrong winner, the cost is likely to be low, and more than offset by being able to run more tests. Likewise, if there is no clear leader after a reasonable period of time, move on to the next test. It probably doesn’t matter which you pick.

As a general rule, if two adverts ‘draw’, I’d go with the established version, as it has proven a good level of performance over a longer period, so it’s arguably the safer option, but that’s just personal preference.

Shane Quigley

Published 25 October, 2011 by Shane Quigley

Shane Quigley is Co-founder at Epiphany Solutions and a contributor to Econsultancy.

8 more posts from this author

You might be interested in

Comments (6)

Save or Cancel

Richard Fergie

When you start to think about the cost of continuing to run a test rather than just if the result is statisitically significant the problem you are trying to solve changes.

The new problem is called the Bandit Problem and (as ever) wikipedia has a pretty good introduction

Essentially, it boils down to balancing exploration with exploitation. Should we exploit what we think is the best solution based on the information we have so far? Or should we explore further so that we can be more sure about which option is best?

An A/B test runs entirely in explore mode until statistical significance is reached after which it switches 100% to exploit mode. As you point out, this is sub optimal if there is an obvious large difference in performance between A and B.

This is the sort of problem that AdWords optimised delivery should be solving for us. But I'm not sure how well it does it :-)

I have a few ideas about how to do this for A/B testing of landing pages. Discuss further on here, find me on twitter or use the contact form on my website if you want to talk about this more.

almost 7 years ago

Simon Williams

Simon Williams, Group Search Manager at Carat Media

With the use of proprietary tools the significance test becomes a great deal easier with Revenue / Margin feeds being pulled in dynamically against advert. But on a SME level of AdWords - testing against these metrics almost becomes impossible with sliding scale, as an account grows the creativity unless fuelled with resource becomes weaker - thus ending in a templated cycle of seasonal creative. We see the decline in testing and the push of CPC unfortunately.

Google does indeed try to simulate a variance test @RichardFergie - but again, CTR doesn't always result in ROI. This blog more than anything really illustrates the necessity to look beyond search and begin to understand a clients business objectives, at the end of the day, the board doesn't care about CTR - its all about the profit.

almost 7 years ago


Richard Fergie

@Simon can you explain what a variance test is? I agree with you 100% that we really need to optimise for profit.

The point I was trying to make is that we need to move beyond significance testing in order to get the best return. I think Shane was moving in this direction in his post but he maybe didn't realise that there was already a lot of mathematical work done at solving this problem in an effective way.

almost 7 years ago



Testing is essential as a means to work out how best to market what it is you do, or what it is you sell (whether it is a product or service). Its then essential to make sure you monitor and make use of the data generated.

To help your seo campaign go in the correct direction record what changes you make and label the results with them clearly.

almost 7 years ago

Shane Quigley

Shane Quigley, Co-Founder at Epiphany

Thanks Richard,

It's clear that unless you have a very high click through rate (and conversion rate, if you are taking this into account when choosing the best advert), the standard normal approximation for a confidence interval isn't ideal.

Even using a more accurate test, I'm not convinced that waiting for a statistically significant result is in the best interests of the account in the long run - it seems slightly contrary that the longer an advert test takes to run, the less value the result will carry.

But in terms of developing a solution to identify the stopping point for a test, I'm not sure how straightforward this would be in practice.

Any model that assumed that the click through (or conversion) rate was constant for a given advert could potentially give misleading results, particularly if your bids change over the test period (so your advert position moves).

We don't tend to use Adwords Optimized Delivery as it seems to make very aggressive decisions alarmingly quickly sometimes, and there's little or no explanation of how it works.

@Simon Understanding the impact on the conversion rate and average order value of your advert test is a whole different kettle of fish.

For most advert tests, waiting to see whether your conversion rate or order values have been impacted by a new advert could extend the length of a test by 20x or more - and if the change is a small one, it's unlikely to make a dramatic change to these values.

For a new advert to generate a different conversion rate, it must (intentionally or otherwise) drive a different 'type' of visitor to the site. This is quite possible if you are testing a 'cheap' message with a 'quality' message, but unlikely if you are testing the impact of capitalisation.

When we start a new advert test, we specifically ask whether the variations are likely to drive different types of visitor to the website, and on the basis of this judgement, we decide on what metric to choose the winning advert.

almost 7 years ago

Shane Quigley

Shane Quigley, Co-Founder at Epiphany

Having considered the problem further, it's becoming clear that the decision of when to stop an advert test is strongly related to the larger picture of your advert testing history.

If your new advert versions are based on the current winner, then it may be the case that every test version has a 50% chance of being more effective than the established version.

However, it's likely that over time, as your established advert is improved, new adverts are less and less likely to be better.

Up to a point, this isn't a problem in itself - even if a new version of the advert only has a 20% chance of beating the original, the advert test is probably still worthwhile, as the possible long-term benefits outweigh the probably short-term costs.

But of greater concern is the increased likelihood that a test version winning could be a false positive (the advert is worse, but random variation makes it look better).

So, the less likely the test is to win, the longer you need to wait to determine which advert is more effective. Potentially, making decisions too quickly could yield more 'false' improvements than 'real' ones, resulting in long-term reductions in your objective value.

Since running adverts with low probability of success for longer increases the average cost of the test (in terms of performance), this raises the question of when to stop testing altogether - another multi-armed bandit problem...

almost 7 years ago

Save or Cancel

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Digital Pulse newsletter. You will receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.