You know what one of my favourite feelings in the world is? 

Just to clarify, I mean at work. More specifically, one of the best feelings you can get when doing email marketing.

I love the feeling I get when one of my subject line tests teaches me something about my audience. What can I say? I’m a super cool dude who gets excited when a subject line delivers amazing response. 

That moment when the opens, clicks and conversions start showing up and you’re like, “I’m the king/queen of email!” 

Yeah, I know you know that feeling too.

But that feeling is rare and fleeting, because most marketers completely screw up their email subject line split tests. 

In this post, you’ll learn how to feel pleasure, or if you’d rather, how to avoid the pain of crappy split tests.

1. Not knowing what you’re testing until the last second

How often does this happen? You spend hours constructing a beautiful email campaign.  It looks awesome, it’s responsive, and you’re convinced it’s the best looking email of all time.

You then spend another while with your data team figuring out the perfect segment to send it to. You’ve done your propensity modelling, your demographic selections, and whatnot. It’s the perfect group.

Then, you spend a while uploading the creative into your ESP. You fight with their HTML editor for a while (standard,) test it out in a few email clients, and pat yourself on the back for a job well done.

And now, you think, “Oh hey, I should really split test the subject line. That’s what you’re supposed to do, right?”

So you think of one line, loosely based upon what you think probably worked last time. And then think of a second one. And you click launch.

And you make a bunch of money from the email because, well, email works. 

But, here’s the thing: a split test, be it A/B or A/B…Z shouldn’t be viewed as a quick way to make a few more bucks. 

They should be viewed as controlled experiments. Because controlled experiments are how we learn about the world around us.

So check out this example of two subject lines that were tested from a recent campaign for a well-known publisher earlier this week (I won’t name the publisher as they’re a client of mine… and I intend to keep them as a client so naming and shaming is not a great idea)

A: “Subscribe now to <brand name> to save up to $1.50 per issue!”

B: “For the latest <industry> trends subscribe now and get the best product reviews around”

Any guesses which one of the above won? The answer is A.

Why did it win?  Here are just a few of the reasons why version A could have won:

  • 'Subscribe' works better earlier than later in the subject line.
  • Exclamation points incite action.
  • Mentioning the price is good.
  • Including the brand name is good.
  • Including the industry name is bad.
  • People don’t care about the content in the publication.
  • 'Save' is a better word than 'get'.
  • 'Latest' is a bad word.
  • Using 15 more characters is bad.
  • Leading with a second-person verb conjugation is good.
  • Leading with a prepositional clause is bad.
  • Using the word 'to' three times in a subject line gets awesome results.
  • Using the word 'and' is bad.

I could go on and list off a few hundred more potential variables. How many more can you come up with?

This is the thing. By doing this split test, they learned nothing. 

Most people conduct split tests without a robust experimental design methodology. By doing this, you’re ignoring the whole point of split testing: to learn about the world around you, or, more specifically, to learn what drives your audience to respond to your messaging.

The subject line is one of the few causal variables you can control at point of launch. If you follow a poor testing methodology, you run the risk of either learning nothing, or thinking you learn stuff which isn’t true.

2. Focusing on one-offs, not longitudinal gains

OK, so let’s go back to the example above. Subject line A got about 1% more opens than B. Fantastic!

Most people will, at this stage, produce a confidence metric to determine statistical significance. And then you’ll say something like, “We are 95% confident that A is better than B”.

So first of all, this is an incorrect interpretation. To be completely accurate, how 95% confidence should be interpreted is as follows: “If we ran the same experiment again, we are 95% confident that the same result, all else being equal, will occur”.

Perhaps a slight semantic difference, but an important one.

This isn’t the main issue however. The main issue is the variance of variance.

Whoah. That’s a mouthful. 

To illustrate, try this little experiment. Run an A/B test, where everything in both A and B are the exact same, sent to random samples from your list. Same creative, same subject line, same everything.

Now of course, these should give pretty much the exact same results… but sometimes, they won’t. It’ll surprise you how different the results will be.

In large binomial distributions with high natural variance, the important thing to look at is the variance of the variance, not the confidence of one hypothesis being proven or disproven.

(Note: a binomial distribution is a data set with only two outcomes – for example, heads or tails. Or in an email context, opens or doesn’t open, or converts or doesn’t convert).

What most people care about is how well A did vs. B, and the statistical significance of this result.  But this isn’t what you should care about if you want to learn about your audience.

Without considering and comparing the amplitude of variance across a series of tests, you run the risk of thinking something is more important than it actually is. 

Looking for one-off wins (A vs B) is great if you’re a meth addict looking for your next hit.  But we should be learning over time to apply the results in a robust and profitable manner.

What you should do is run a series of controlled experiments over time, and then learn from the longitudinal results, not just individual data points.

This requires a lot of planning, a lot of number crunching, and a lot of patience.  But it’s the only sound way to provide durable, predictable revenue uplift.  

3. Confusing correlation for causation (aka the eighth deadly sin)

Have you ever been to Israel? Well, here’s an interesting fact.

Now, there are many different viewpoints on whether or not Israeli hummus is better than that of neighbouring nations. I’m not getting involved in that debate. 

Anyways, the Israeli diet can broadly be defined as Mediterranean. Lots of olive oil, fresh fruits and veg, and healthy fish. This diet has been widely connected with lower incidence of coronary heart disease.

And yet, Israeli Jews have a higher than average incidence of heart troubles.

So, for years, we’ve been told that the Mediterranean diet is good for a healthy heart. We've been told that the link is obviously causal.

Yet, an outlier like this shows that the link is not necessarily causal at all. 

It is certainly a strong correlation. The Mediterranean diet may reduce the odds of getting heart disease. But there are clearly other variables at work here. For example, smoking rates, genetic factors, exercise frequency and the like. In fact, it could be that the diet actually causes heart disease and it’s a misleading assumption! 

For those of you who skipped statistics classes in college, let me refresh your memory:

A correlation occurs when variable X is related to variable Y. For example, when you see puddles in the street, it is often raining.

Causality occurs when variable X causes Y. For example, when it is raining, it causes puddles in the street.

See the difference? Puddles are related to rain, and rain causes puddles. 


So, why does this matter? Because when you run controlled experiments it’s vital that you can identify which variables are causal, and which are correlative.

Taking our subject line example above in point one, what if you thought the causal factor was “People don’t care about product features in the email.” Fine, fair enough. 

So then you send out another email tomorrow without any features – but you have no context with which to interpret the results from that campaign. 

You’re effectively testing because you think you should, not because you’re learning about what makes your audience tick.

Am I right or am I wrong?

Who knows. But one thing I’m curious about is the common practices across the industry, learning where people are with their email subject line split testing strategy.

So, do me a favour. If you do email marketing, take a couple minutes and fill out this survey: The State of Split Testing

It’s the industry’s first ever look at how people run controlled subject line experiments in their business. I’ll be analysing and sharing the results in a future Econsultancy blog post – you’ll learn how you stack up against your peers, and where there are areas of opportunity. 

Don’t screw up your split tests!

With a bit of rigour and methodology planning you can be your business’ subject line superhero. Or, you can be a subject line meth head, chasing that fleeting gain in subject lines week on week. 

It's up to you. All I know is that thousands of people much smarter than me have learned through experience how to use experimental design to learn about human behaviour. 

Why should we be any different?

PS – That survey link is here.  Fill it out! 

Parry Malm

Published 10 October, 2014 by Parry Malm

Parry Malm is the CEO of Phrasee and a contributor to Econsultancy. Connect with him on LinkedInTwitter or Google+.

25 more posts from this author

You might be interested in

Comments (14)

Comment
No-profile-pic
Save or Cancel
Pete Austin

Pete Austin, CINO at Fresh Relevance

Here are two reasons to not trust "95% confidence" values:

(1) It is very difficult to be sure you're testing what you think you're testing, i.e. you've less control over "all else being equal" than you thought. For example, maybe your A and B samples were sent at slightly different times and what you actually tested was time of day.

(2) Data dredging returns a lot of false positives, so only trust significance values if you decided what you were going to check for in advance of the test (which is usually not the case for A:B tests).
http://en.wikipedia.org/wiki/Data_dredging

These are major issues for science and huge numbers of experimental results are actually bogus. For example:
http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble

So what should you do? Simple: reproduce the result

if you are going to depend on the result of your research long term, you *MUST* reproduce it. Run a different experiment to check for the same effect, at a later date, and compare the significance of the result from this experiment.
http://en.wikipedia.org/wiki/Reproducibility

about 3 years ago

Parry Malm

Parry Malm, CEO at Phrasee Ltd.

@Pete you're absolutely spot on dude. I quite often see case studies showing 99.7% confidence that A is better than B, and assuming that it's inherently true that A rules and B sucks.

But that's not what the point of frequentist statistics is. The point is to make a hypothesis, and yes, reproduce it.

The problem here is the entire space of potential subject lines is huge, so reproducing a subject line test on the same audience over and over has functional limitations.

If we were talking about approving use of a pharmaceutical, yes, it's important, but we're talking about email marketing. So, some shortcuts are acceptable... what's important is marketers understand the risks involved with said shortcuts, and don't assume that "95% confidence" means that what they think to be true is... well... true.

about 3 years ago

Avatar-blank-50x50

Jacques Corby-Tuech, Marketing at ETX Capital

Testing a sales oriented email for a subject line uplift seems fairly pointless in general. If the objective of the email is to sell, you should be testing the overall conversion rate.

I've carried out tests previously where the email with the lower open rate performed better in terms of the overall conversion rate, with the hypothesis being that the subject line primed people to buy, rather than just piqued their curiosity to see what the email was about.

It's always important to remember what you're actually trying to achieve with your email campaigns, and use tests to try to achieve that objective.

about 3 years ago

Parry Malm

Parry Malm, CEO at Phrasee Ltd.

@Jacques There's no reason why you can't use conversions as your dependent variable. The main problem is one of sample size (if conversion rates are low relative to universe size it limits your experiment boundaries,) but that's a whole other blog post :)

I simply referred to open rates as an illustrative metric, but I agree, what matters is the money. If one subject line causes more conversions then for sure - that's the winner!

about 3 years ago

Iain Russell

Iain Russell, App Marketing Project Manager at Immediate Media

I'll be interested to see which survey link gets the most clicks and which one has a higher percentage of survey completion ;)

about 3 years ago

Parry Malm

Parry Malm, CEO at Phrasee Ltd.

@iain you can put a wager on it at William Hill

about 3 years ago

Kate Gowers

Kate Gowers, Principal Digital Consultant at Ogilvy&Mather UKEnterprise

RE causation vs correlation - you'll like this. http://www.tylervigen.com/

The mention of that particular mix up reminds me of a logic flaw (related but slightly different - this is about drawing spurious conclusions without appropriate evidence: "all fish swim. Fred swims. Therefore Fred's a fish."

about 3 years ago

Parry Malm

Parry Malm, CEO at Phrasee Ltd.

@kate omg I wish I had that before I posted this! Way better examples!

about 3 years ago

Jordie van Rijn

Jordie van Rijn, email marketing specialist at emailmonday

Actually in the #1 example, you didn't learn anything, but..... if they deployed a winning version they could have still gotten a conversion bump. Nothing learned, dollars earned. Not that bad. A big problem with the example you gave is that they only used 2 versions.

But there are still enough reasons not to test:
http://www.emailmonday.com/email-testing-reasons

PS: Parry I did like your article, it was not an exception to the rule ;)

about 3 years ago

Parry Malm

Parry Malm, CEO at Phrasee Ltd.

@jordie yep, version A did make a little bit more money. But, it's a short-lived success.

Best case, one should make more money, but ALSO learn something about their audience. Otherwise, you're constantly guessing and never learning.

I understand the points of not testing and that's certainly one strategy - if it takes too much time to design experiments, or if you lack the internal skill set to design a robust series of experiments, fair enough. But, it's more a matter of over-arching marketing objectives - do you want to learn, or do you want to constantly swim upstream against the unpredictability of consumers?

Anyways cheers for commenting Jordie! Always interested in hearing differing points of views :)

about 3 years ago

Avatar-blank-50x50

Elite Marriott, Consultant at Digital Heart

Brilliant post and great discussion. I would LOVE to run tests with smaller clients and smaller datasets - is it even worth running tests on a database of 2000?

about 3 years ago

Pete Austin

Pete Austin, CINO at Fresh Relevance

@Elite: is it even worth running tests on a database of 2000?

Yes. I've had good results with samples of 1000. Marketing decisions don't need a very high likelihood of success and even 80% confidence is fine when the alternative is a guess.

As a rule-of-thumb, with small sample sizes, you can be somewhat confident that A is better than B if it gets 8 additional responses.

But the results from such small samples are unlikely to be reliable enough to combine together into a knowledge bank, along the lines suggested by OP.

about 3 years ago

Parry Malm

Parry Malm, CEO at Phrasee Ltd.

@ellie like Pete says, it is absolutely worth doing. The required sample size for a low confidence binomial hypothesis test isn't huge.

But I would disagree with Pete on the longitudinal analysis point. There is lots you can learn from building a model to predict things for you audience. Not having larger data will make your predictions more volatile, but less so than guessing and praying every time you click launch :)

about 3 years ago

Avatar-blank-50x50

Elite Marriott, Consultant at Digital Heart

Cool thanks guys, I like the idea of trying to repeat things to confirm assumptions. This goes into my to do list!

about 3 years ago

Comment
No-profile-pic
Save or Cancel
Daily_pulse_signup_wide

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Digital Pulse newsletter. You will receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.