Advert testing is critical to the continuous improvement of an Adwords campaign. It’s a reasonable bet that many of your competitors are testing new adverts, and hence improving their click through rates and conversion volumes over time.

Where do you think those clicks and sales are coming from?

If your competitors are steadily increasing their click through rate, and you aren’t, your own CTRs are likely to drop steadily over time, even if your adverts remain the same.

And with that is likely to come a gradually reducing Quality Score, leading to higher cost per clicks or lower positions. It’s like standing on a down-escalator: the only way to stay in the same place is to keep trying to move up.

Clearly, you need to improve your adverts over time, and that means a:b testing. The principle is simple, you write a new advert, display it half the time (with the established advert appearing the rest of the time), and once the difference in performance is significant, you keep the better advert.

There are many complications you can add to this,  looking at conversions or revenue, displaying the established advert more often and running more than two adverts at a time are just a few.

Ultimately though, if you want to be confident which advert is better, sooner or later you’ll have to run a significance test to determine whether the difference in performance is ‘genuine’ or simply the result of random variation. But it’s in the testing that things start to become a bit more complicated.

The test that most people use is Student’s T-Test, which compares the average performance of each advert, and estimates how likely it is that the typical variation that they see from day to day is due to random variation, as opposed to a genuine difference in performance:

But this test assumes that the average doesn’t change. In general, if you’re looking at conversion rates, this isn’t a problem, but the click through rate of adverts does change over time, and it’s not random.

Even if you don’t change your bids for the duration of the test, you’re still going to see some variation. And different keywords in the same Ad Group may get more or less searches each day, so if they have different click through rates, the overall click through rate will vary.

The point of all of this is that if a significance test throws up a 95% significance level (i.e. the probability that the difference isn’t due to random variation is 95%), the true significance is probably greater. This being the case, you should probably opt for a lower significance level than you normally would.

It’s interesting that this is probably a good idea anyway, waiting for a significant difference on a test may not yield the best results in the long run.

There is a trade-off when you decide how long to wait before declaring a winner in the test. On one hand, you don’t want to choose the wrong winner, but the sooner you pick a winner, the more tests you can get through.

Given that the longer the test takes to give a result, the closer the two adverts are likely to be in performance, it’s likely that the cost of choosing the wrong advert is likely to be lower, the longer a test goes on.

So perhaps the answer is this: once an advert has a healthy lead over another, don’t wait for a significant result, end the test, take the (probable) improvement, and move on to the next test.

Even if you choose the wrong winner, the cost is likely to be low, and more than offset by being able to run more tests. Likewise, if there is no clear leader after a reasonable period of time, move on to the next test. It probably doesn’t matter which you pick.

As a general rule, if two adverts ‘draw’, I’d go with the established version, as it has proven a good level of performance over a longer period, so it’s arguably the safer option, but that’s just personal preference.