Multivariate testing or MVT is synonymous with the testing and conversion optimisation industry - not forgetting the other inherently flawed three-letter acronym, CRO.

MVT is often used by businesses as a catch all term, used to describe the fact that they have a testing tool and that they are running tests on their website.

MVT. It sounds exciting. It sounds intelligent. It certainly sounds like there is much more to it than plain old A/B testing.

We are a conversion optimisation agency and we have never run a MVT. Why? Let me explain.

MVT: Testing with no hypothesis

What is MVT?

In short, this is where you want to test more than one variation of more than one element of any given web page simultaneously (i.e. three different headlines and three different button colours), and you let your MVT tool create numerous test variations with every possible combination of headline and button colour.

It is often the case that businesses have 16, 32 or even 76 versions of a page being served to visitors for any one MVT test.

The main alternative to running multivariate tests is running straight A/B or A/B/n tests.

Why is MVT popular?

Mainstream promotion

Google was one of the first providers of a tool to allow website owners to run these types of tests back in 2008.

Since then, one of the industry’s biggest and most well-known testing tools has built a business on being an “enterprise MVT tool”. MVT sticks in the mind easier, as do all three letter acronyms based partly on how the human mind likes ‘The Rule of Three’.

The term is used to describe testing in general

Often when we are speaking with senior decision makers they refer to MVT as the catch-all term for their optimisation strategy, even if they are mainly running A/B tests.

It sounds intelligent and complex

On the surface MVT sounds like there is some intelligence and science behind it. The prevailing thought is: ‘This isn’t just basic A/B testing, we are testing multiple variations at the same time. It must be good.’

What is the biggest problem with MVT?

MVT lacks a crucial ingredient when it comes to running a test - a reason why. Why are we doing this test? What is our hypothesis? What are we aiming to learn from running this test? Why have we chosen to make these changes?

Running multivariate tests ignores the skill and experience of the person/s planning and creating the test hypothesis and creative execution, and instead places the work on the tool to serve any number of combinations to visitors and to finally tell us which of the many variations has performed the best.

MVT: No why behind your tests

Many companies invest a significant amount of budget each year on enterprise tools with a much smaller budget invested on people and skills. This is sad.

It indicates that testing is seen as more about the technology rather than what it truly should be - driven by a multi-disciplinary team who create insight driven test hypotheses across the full spectrum of testing.

What do I mean by full spectrum testing? This means everything from simple, quick iterative testing, all the way through to testing business models and value propositions.

Why MVT should be renamed NHT

Anyone who is testing should have a hypothesis behind each test. Why are we running this test, what behaviour are we expecting to change and what impact are we expecting to get?

In its very basic form, this is how a hypothesis should be structured:

  • By changing [something] to [something else] we expect to see [this behaviour change] which will result in [the impact on our primary/secondary metric].

As businesses mature within conversion optimisation, they recognise that this basic hypothesis structure is lacking one critical element: the observations and insights which have led to creating the hypotheses in the first place.

This is a more intelligent structure for your hypothesis:

  • Based on [making these qualitative/quantitative observations and based on prior experience/test learnings], by changing [something] to [something else] we expect to see [this behaviour change] which will result in [the impact on our primary/secondary metric].

So there we have it. The intelligent, insight driven hypothesis structure you should be using.

Let’s go back to MVT and evaluate how this compares. A hypothesis structure for MVT could read something like this:

  • By creating [lots of variations to our control page] changing [a wide range of page elements such as our headline, image, copy and call to action colour] to [other random variations of headlines, images, copy and call to action copy] we expect that our testing tool [will create variations of each permutation and serve these over weeks or probably months] which will result in [at least one of the variations out-performing the original, in which case we have a success and can then produce a detailed analysis report]. 

It doesn’t quite follow. It doesn’t have intelligence. It lacks any real form of data and customer insight. Plus it will probably take three months to get anywhere near statistical significance.

This is why MVT should be renamed NHT. No Hypothesis Testing.

Four reasons we do A/B testing rather than MVT

Since I first started my business back in 2004, we have never run a MVT. We almost exclusively run A/B tests and here are four reasons why:

1. When each of your test hypotheses are driven by intelligent user research and prior testing and have a clear purpose of positively altering user behaviour, you can confidently create one test variation against a control with the expectation that it will deliver an increase in the primary performance metric.

2. Tests reach statistical significance far quicker than if you were running five or more variations at one time. Time is money.

Each day is an opportunity to learn something meaningful about your businesses visitors and customers. Each day is an opportunity to create new ways of increasing the revenue and profit your visitors are delivering for your business.

A/B testing allows you to run back-to-back tests covering the full spectrum of testing to build and maintain testing momentum, rather than relying on one big MVT running for weeks or months – with the often faint hope that one of the multiple variations out-performs your control.

3. A/B tests allow you to draw meaningful insights from the test outcomes themselves, whereas MVT doesn’t allow you to draw conclusions on which elements impacted your customers and which were just extra noise that had no impact.

Don’t underestimate the value of the learnings and customer understanding you can gain from “simple” A/B testing; they will allow you to make your testing programme more efficient, more progressive and can have big positive implications on the wider business.

4. With A/B testing everyone involved knows the reason they are doing what they are doing:

  • They have the hypothesis.
  • They may have seen the research and data.
  • They know what the goal of the test is.
  • They know that they are not just testing on whim.
  • There is a ‘why’ behind the work they are doing.

MVT turns this process into a sometimes complex technical set-up to get all the elements and variations set-up, QA’d and ready to go live.

MVT isn’t and has never been synonymous with agility. In fact, the technical complexity of setting up and QA’ing MVT can often be one of the major bottlenecks in a company’s testing strategy.

So what next for MVT?

MVT needs to go in to a quiet room with its big brother CRO and have a long hard look at itself. MVT needs to realise that its time has come and gone.

Now is the time to get back to what testing and optimisation should be all about – developing intelligent hypotheses and running clean A/B tests which conclude quickly and deliver insights, learnings and helping grow businesses.

MVT should look at its bigger brother CRO

For more on this topic, check out these A/B testing success stories:

Paul Rouke

Published 25 November, 2015 by Paul Rouke

Paul Rouke is Founder & CEO at PRWD, author, creator of the CRO Maturity Audit, and a contributor to Econsultancy. You can follow him on Twitter or hook up with him on LinkedIn.

40 more posts from this author

You might be interested in

Comments (19)

Save or Cancel
Paul Randall

Paul Randall, Senior UX Architect at Evosite

Great post Paul, and I completely agree on testing a hypothesis.

If a test wins and you don't know the reason why, it turns into luck or best judgement, neither of which are successful optimisation techniques.

over 2 years ago

Paul Rouke

Paul Rouke, Founder & CEO at PRWD

Thanks for your feedback Paul.

There are two crucial "why's" here - the why behind the test, and the why behind the test result. When we move deeper in to isolation and batch testing, the why behind the test result isn't always as clear (for batch testing) but there should always be a clear, insight driven, intelligent why behind the test, irrespective of the changes being introduced.

And yes absolutely, luck isn't a successful, growth driving optimisation technique!

over 2 years ago


Jennie Blythe, Head of Ecommerce at Whistles

I would always expect to have some kind of hypothesis behind an MVT also. I'd select the variants with this in mind.

over 2 years ago

Tim Schwarz

Tim Schwarz, Head of Digital Marketing at University of Surrey

Hi Paul, a very interesting perspective. I must say coming from client side that I do still see value in MVT.

Mainly that you can run a higher number of tests over a shorter time period so change 'should' come faster. It can also certainly help manage internal stakeholders who all think their idea is best (e.g. the HIPPOs you often speak about!).

I do agree though that the process for A/B testing is more enjoyable with more time to think through the ideas and hypotheses. And more clearly able to pinpoint why a conversation rate change has occurred. Which I can see delivering better improvements in the end.

over 2 years ago


Richard Game, worker ant at Cressive

Very good. Binary simplicity. Benefits all.

over 2 years ago


Deri Jones, CEO at SciVisum Ltd


> It can also certainly help manage internal stakeholders who all think their idea is best!

That's a valid point. Does it raise other questions though - strategic ones such as:
* do we have too many stake-holders!
* would it help if our process asked stakeholders to coordinate more, and agree on fewer common ideas, rather than pass all ideas through?

Or whether our process needs to be more evidenced based; asking stakeholders to do more prep, Paul described that prep as:
> ... making these qualitative/quantitative observations and based on prior experience/test learnings...

over 2 years ago

Paul Rouke

Paul Rouke, Founder & CEO at PRWD

@Jennie - thanks for your feedback. I'm interested to know the type of structure you have your MVT hypotheses? Also, approx what % of your tests at Whistles are MVT versus straight A/B/n?

@Tim - thank you very much for your feedback on your testing strategy. I expect that with your traffic volumes compared to most businesses you at least do have the levels which allow for more variations to be running for a test. What type of measures do you have in place from a statistics perspective for when you conclude your MVT tests? Also approx how many page tests variations do your MVT typically have?

over 2 years ago

Paul Rouke

Paul Rouke, Founder & CEO at PRWD

@Richard - thanks for commenting.

@Deri - thank you for your responses to Tim, you've made some very valid points and pulled out what is probably the most undervalued and under-utilised part of an intelligent hypothesis - "..making these qualitative/quantitative observations and based on prior experience/test learnings.."

over 2 years ago


Deri Jones, CEO at SciVisum Ltd

There is also the user experience angle on A/B and MVT - in terms of potential to slow down pages, or even cause subtle errors that prevent a % of conversions, perhaps only on some devices, or only in some product areas:

Basic performance monitoring can become useless: e.g. if the 'product filter' button changes name or moves around, then a simplistic monitoring journey will throw errors whenever the un-expected version is served!

Now that many eCommerce folks are highly focused on the experience on the Customer Journey: it's a shame if that focus drops away because of the extra challenge of A/B / MVT.

over 2 years ago


Darren Ward, Director of Product Marketing at User Replay Ltd

Good article and all makes great sense. I think it is interesting how much investment companies put into this type of technology when they haven't even got the basics right of making sure their customers have a decent experience on their site.

Surely more sensible to ensure you first have the tools in place (such as UserReplay) to monitor and measure customer experience before even thinking about MVT. Get the basics of a good customer experience right first and the rest will follow.

over 2 years ago

Greg Randall

Greg Randall, Director at Econsultancy Guest Access TRAININGSmall Business Multi-user

Hi Paul,

Great article, I 100% agree with your comments. I too have been testing since the "early days" for my clients. No disrespect to any of the comments made above, but in my experience, MVT is conducted by those who do not understand the consumer journeys being undertaken, and the journeys retailers want to deliver.

This lack of understanding leads to the shotgun approach hoping something will stick.

For the comments made around how MVT placates stakeholders, this actually does the opposite. Due to the lack of controls in place and the greater propensity for "pollution" creeping in, there will never be a definitive winner that will satisfy stakeholders and more importantly add value and build on consumer experiences.

The way I have successfully managed stakeholders with significant positional power is to conduct a series of AB tests combined with the right hypothesis gathering. The stakeholder feels they are adequately heard and is clearly proven or disproven (the majority of the time they are wrong because the source of their hypothesis is driven by personal opinion:) ).

over 2 years ago


Jennie Blythe, Head of Ecommerce at Whistles

Probably only 10% are MVT. We do run a good number of AB(n) also. Hypotheses come from various sources: internal stakeholders, the exit survey, Google Analytics, etc. The best results are usually driven by experience of customer friction points.

over 2 years ago

Tim Schwarz

Tim Schwarz, Head of Digital Marketing at University of Surrey

@Paul / Deri - I couldn't agree more with your comments. Currently there isn't MVT running here, it was in a previous role with even larger traffic volumes. We are focusing A/B testing in a few specific areas for many of the reasons you mentioned and thankfully the 'doers' are empowered with decision making so we can really focus on what works from a conversion and a user experience perspective.

I also find that a not too lengthy process that combines A/B testing with a user research driven program has led to strong uplifts in conversion (and NPS), as well as ensuring there is confidence from senior stakeholders that the right decisions are being made.

over 2 years ago


William Dixon, Analytics and Optimisation Consultant at Freelance

Tend to agree with this Paul. Never really found an application for MVT that wasn't better run as an ABn. MVT was part of the original testing sales pitch and stuck because it sounds cool. The trouble is data and time are usually limited so if you want to achieve success you need to use your skills to fast track to the best combinations rather than rely on a machine to work it out for you.

over 2 years ago

Tim Stewart

Tim Stewart, Optimisation Consultant at trsdigital Ltd

@Paul Can open. Worms everywhere!

Not sure there is a quick answer to this and I am a little limited in what I can share publicly on the tests I have done. I've been split testing since 05 and doing it pretty much exclusively since 09. I'd say I have run a fair mix of MVT and AB testing, probably a 40/60 split, easily well over 500 tests of all sizes.

So some general feedback and a very anonymised example from recent history but with a bunch of experience using both to good effect.

I have run very successful and clear sets of ABn tests and very clear and successful MVT

It might be because I used to work at Maxymiser and now work with SiteSpect which are both enterprise solutions. This offers greater technical ability from the tool (and greater detail on metrics) . It also means I usually have had plenty of traffic to work with.

It also means that we aren't waiting months for useful data on an MVT and typically will have multiple tests running at once. I have one client who have the traffic and resource to run 30+ at any one time. With SiteSpect you can layer multiple tests independently on a page or user journey, with Maxymiser its a little harder as you need to avoid script conflicts particularly on the metric tracking but it is possible. Both tools provide developer resource to enable that sort of work but setup and QA on an MVT over an AB with lots of variants is not particularly slower or more difficult.

I have used Optimizely and GWO (when it still supported MVT) and I would say they are less accurate/clear on their MVT reporting which may be a factor in your experience. I have also used some of the wave based, Taguchi and full-factorial tools and their approach is a little more like you have described - load all the ideas and see what the tool spits out.

So this might be a factor in your recommendation to use AB over MVT

That said even on the biggest sites an MVT with too many combinations can still be prohibitive, depending on the decision metric you are using. And regardless of the technology a poorly planned MVT or AB will produce unclear results

I really don't think it is as simple as one is better than the other.
It depends on the context and using the most suitable solution for the situation.

They each have their purposes and I would agree that a clear hypothesis, tracking the right metrics and variants that are structured to answer the question posed by the hypothesis makes the most difference - regardless of the testing methodology or technology.

Again maybe because I mostly work with Enterprise I do encounter the opinion you are describing - AB is simple testing, MVT is grown up testing. I often have to argue that a well-structured ABn should not be considered the poor cousin of MVT.

I've been able to deliver some spectacular clear repeatable results with "just" ABn on SME and Enterprise clients. I've also presided over some complete duds using ABn. So its not a question of clarity or traffic but good test structure.

I've also talked down Enterprise clients who had the traffic for an over-ambitious MVT into using sets of iterative ABn tests to answer their questions. I have run well structured ABn tests and got an undeniably clear result, when the client had tried a poorly conceived MVT and failed to get an answer.

So in no way am I of the opinion that AB is the poor cousin of MVT.

But I strongly disagree that MVT is useless and needs to get back in the corner

I do however agree that MVT is frequently used ineffectively and is often an ego choice rather than the appropriate solution in all cases. So the hype seems inappropriate because it’s a little more effort to think it through and used incorrectly no greater return.

The way this post is written suggests that your opinion has come from people running poorly structured MVT, which I would agree are almost always less useful than a well-planned AB.

The MVT and the perception that the technology takes care of it all lends itself to "throw every idea at the page" so people often do just that. Some vendors are happy to let them do so because they charge more for MVT and this very “test all the things” approach is one of the differentiators they use to justify an upsell.

But that is the fault of the sales pitch and the vendors, not the method.

Every idea needs to justify its existence in the test and have a clear relationship to the hypothesis. Often they don't. Often the only justification is "we like this new design" and there is no structure or consistency in the changes being made.

But I see that in ABn all the time too, Variant 1 is diametrically different to Variant 2 and both are vastly different to Variant 3. And all are so different to Control that if one is shown to be statistically different (up or down) you don't know why. It could be one of several things that you changed.

ABn should change only one thing, one element with multiple ideas on how to change that one element. Even when you do structure it so that each change is isolated it can be hard to see how much contribution each part made.

And for me that is where the main difference comes
It is where MVT can deliver an advantage

Done well, AB is good for a broad stroke, better/worse comparison:
- We have an idea of directions from our analytics, heatmaps and user feedback.
- We have come up with some potential solutions with hypotheses
- We will try a couple that agree and a couple that counter that hypothesis
- If one approach/theme works we double down on that line of enquiry.

It allows you to understand quickly
1. IF there is opportunity
2. A general direction where it might lie.
Right or Wrong, Black or White, Yes or No. Test, get result, move on.
Iterate quickly and the effect on the business and the direction you follow may seem like there was no other way you could have uncovered.

MVT in its basic form does the same - you have several ideas about what could make a difference. Several elements might be a factor on the page and your hypothesis of what "better" looks like. Pages where one change requires something else to change so you introduce another element, whether you use MVT technology or not.

MVT allows you to test multiple Elements at once
But each Element is its own test, a test within the wider test
As such each Element needs its own hypothesis, each variant its own reason for inclusion.

With MVT the end result is still that the user sees one permutation, albeit made up of several hypotheses. But in that case it is still true that each factor/area needs a hypothesis.

Doesn't matter whether it is MVT or AB testing, if you aren't challenging a hypothesis then what you measure and report will not be clear.

So MVT is not (or shouldn't be) No Hypothesis Testing
If that is how it is done then you are right, that is not testing correctly.
It should be Multiple Hypothesis Testing.

Each Factor/Area/Element you are exploring needs a hypothesis on what effect it has on the user, each variant within that must be a way to explore how to change that Element to influence behaviour. The metrics used must relate to the things you are changing

And an MVT *can* be run as if they were distinct AB tests running concurrently.
If used (abused) in that way then it can still be useful
- multiple changes in one test gives you three answers
As long as you have:

The metrics setup to measure the most appropriate thing for each Element
The traffic
A tool that allows you to separate these.

If your tool limits you to only one metric, or Next Click and Sale only then results can be unclear. If the tool can't report on the different parts of the page, the factors/elements you are changing independently as well as the combination then you will struggle. You can work it out offline but its painful (I'm looking at you Test & Target)

You still want the main decision metric, after all the Elements *should* be related to each other and should be part of the decision process to take Action.

But the ability to measure secondary metrics, substitution clicks if you are changing parts of the page that may promote alternatives to your main metric… these are useful ways to check how and why the main metric changed.

So it can be a time saver, several things tested at once and with different ways to look at the data plus the advantage that it is in the same sample period/buying cycle.

However, MVT should really be used when those two, three, four plus areas of the page have their own hypotheses but there is also an overarching hypothesis that links them.

It is a study on how they (and if they) interact and, possibly most importantly how much each part contributes to the whole.

The Enterprise level tools report MVT in two ways
The combined result of the permutations that come out of the experiences/combinations
The individual performance of the variants when averaged across all combinations
- the sum effect for a variant in an element when combined with good, bad, ugly combination of the other areas.

This provides a level of detail that *is* possible on an ABn.
But harder to achieve without some clear thinking and some additional maths offline

In short MVT Combinations that the user sees are basically just ABn variants, but the way the test is setup it would be a much larger ABn and the way in which you can report on both the individual parts and the sum of the whole can be really useful.

At the combined reporting level you will see good, bad, ugly performance of the full experience the user saw, just as you would if you run an ABn of what those combinations look like

But you also see how much contribution each part makes to the difference you record
You can also see if there is a positive catalyst effect or a negative counteraction effect when "good" and "bad" results are combined, when different ideas in different parts of the test have greater or lesser impact on the user.

MVT is most useful when you have related factors
"Improve this page" is not one change in isolation. Anyone presenting me an MVT hypothesis like you suggest would be sent back to the drawing board. What you need is your AB hypothesis structure but for every element and every variant in those elements. MVT should have this too, you just need more of them.

Ideally you have a Test level hypothesis but it joins these together: if hypothesis for section 1 is proven, section 2 is proven and section 3 is proven then we see this effect on this metric. If 1 is proven, 2 is disproven and 3 is unclear we see this effect on this metric. Just like AB you structure your test, your variants, your metrics in a way that you can answer each scenario.

It could be both the form fields AND the validation used AND the way that is messaged.
You might see more effect from validation if you have more form fields
You might see better effect from validation messaging if your validation is particularly painful, less impact if it is not.
You might not see any effect on Total Forms complete because people want/need to buy anyway
But you might see a drop in Attempts to Complete that failed or a count of Validation errors per user - so no proof that "better validation and shorter form increases sales" but clear proof that "better validation and shorter form reduces errors, time on page, number of retries".

So it comes back to what you want to achieve and how you measure what you are changing – in both AB and in MVT.

With MVT you can see which of these things work together, which counter-act each other and which are big contributors to the end result; which are not worth bothering with, which are so big an effect they drown out the signal on more subtle changes.

Tricky to explain so I will try to give a generic example where MVT can be more useful than AB. Or where MVT will at least reach greater understanding of WHY an uplift happened faster than a set of iterative tests.

It allows you to stack hypotheses for multiple parts of the page to answer a wider hypothesis for the whole page. Conversion is often the sum product of multiple factors or elements. MVT allows you to play with the variables to find a balance for what is optimal
AB gives you a direction, planned well can hint at Why
Running sequential tests and adding them up to get intelligence and a body of evidence
MVT gives you intelligence on which bits are most important and quantifies by How Much.
Running separate but related tests on the component parts at the same time
But allowing you to drill into key areas to get the optimal mix

MVT is poor for direction and big wins.
But it is great for nuance and getting the most possible

AB is great for direction and big wins
But when you test and find no direction or you have tested before and the increments are smaller then it becomes harder to see the signal and refinement is a much slower process.

A classic area to test is on the Product Detail Page
There is a fairly standard set of components; Image, Description, Price, CTA with a collection of distractions such as Recommendations, Social Media, Tabbed information, specifications, reviews etc which are typically lower on the page

Now a well-structured set of ABn might try to test the Price in the Action Area (the price, description, size selector, CTA etc) with the following logic:
Control, New Price Block 1, New Price Block 2, New Price Block etc.
Price Block (meaning the section where the price, saving and previous price are displayed)

This is a simplified version of a set of tests I have run on this sort of area where an MVT was more useful than a single AB or a set of ABn tests.

Our research shows Price is a factor in decision to Add to Bag
We have data and user feedback to say our prices are competitive but people don't realise how much they can save, a lot of exits on this page are people searching for deals when we offer better discount than they will get elsewhere.
We should make the Total to pay and how much they can save clearer
This will result in more adds from our majority price-sensitive audience
So we want to test against
Size of the Price area
Presentation of the Savings Message
Presentation of the Was Previous Price

The total Action Area hypothesis is
- clearer Price and Savings will increase volume of Add to Basket
The hypotheses for each area within that were different ways to achieve "clearer"

They'd picked their new design
Same size price, Red for the Savings and using Percentage and crossed out Previous Price

They wanted to run an AB to prove their new design based on user feedback was a winner.

Ultimately an MVT delivered some nuance into how that was best achieved and gave them more than double the uplift than if they had tested their best guess BUT more importantly gave us useful intelligence on what worked and what hadn't within that so we could refine that further

Test ideas Included:
Savings Message Presentation
- Savings are a key Selling Point but they are smaller than the main Price and greater clarity would help

Red - Stands out, High Contrast, some might perceive it as negative to distract from main price
Black - Control

Savings Message Format
- Hypothesis that % saving is what people look for not the £ value

Percent Saved - Show relative saving vs previous, Users react better to 25% off than making them do the maths
£ value Saved - Control

Price Block Size
This presents a range of sizes so we are not just testing Bigger but How much bigger (and if Smaller is negative) then we reconfirm that clarity of the Price information is important

Biggest Possible Price Block - Most Impact, clearest to see on page load
Big Price Block - Increased Impact, clear to see on page load
Medium Price Block - Control, no change to current impact. This is what we measure against
Smaller Price Block - Counter hypothesis, if smaller is better, then Bigger is not better

Previous Price Presentation
Previous Price - Showing the Was price helps frame the saving but this higher price near the main price might be confusing some users. Testing different ways to treat this should see whether they hypothesis that this is confusing/distracting is correct

Not Crossed out - Control (this was Black and bold, smaller than the main Price but quite high contrast on a white page)
Crossed out - Makes it obvious this higher price is not to be paid, draws less attention to this section, reduced distraction (but no other change in size or colour)
Hidden - Least possible distraction, may lose impact for Savings message if the From is hidden and only the Now shown

There was some debate over size, whether to also make this grey/low contrast, include or not include the Was text. But the recommendation was to again try a range so this went from visible - less visible - gone completely to see if the hypothesis that making this look less like the main Price had merit to explore further. Follow up ABns were planned with those ideas.
So in this section we added in a couple more to explore some options for those future tests.
Grey Not Crossed Out – Crossed out works, lower contrast reduces distraction, Hypothesis that distraction is negative
Grey Crossed Out – Cross out is not required, lower contrast is sufficient to reduce distraction enough to be positive
Acting as an AB within the AB within the MVT

But if they had been done in the proposed single ABn they wouldn't have tested these alternative approaches.
They would have found their preferred new design (Same size price, Red, Percentage and crossed out Previous Price) was a bit better than Control (about 2%) and been pretty happy with that.

This could also all have been done as a set of ABns
Even as sequential ABn tests there would have been something more they would have learned
They could have found that
Red did improve Add to Basket on Savings message
Percent Saved (no colour change) was a small negative difference to Control - not highly significant if % or £ saved worked better
Biggest and Big beat Medium by some margin, Small was clearly negative further confirming the Bigger hypothesis by disproving the counter hypothesis that Smaller was better
Previous Price - small but measurable difference: Hidden = Grey Crossed out > Grey > Control Crossed Out > Control

Now they had assumed % Saved was positive so if they'd tested their original idea as an ABn they would have been happy with 2% uplift but not known that Red was what made the most difference and it could have been higher if they had tried Red £ not %age Saved.

There had been a lot of discussion about the best way to handle Previous Price but ultimately it was shown that this Element of the MVT had the least standalone benefit.

Because the most difference for all the Elements came when combined with Size change

Increase in Prominence had an accelerating effect
– what was Good was Better, what was Bad was Worse.

When Red was combined with either of the Bigger variants it reported an even higher impact
When reported against control it showed some uplift
When combined with the size change this was nearly doubled

Similarly, we could expose areas which would not have been evident in the AB that they were negative and had reduced our potential win
Big Red % was worse than Big Control Colour %.
Big Control Colour % was worse than Medium (control size) % which was ~= £ (control)
Big Red £ was better than Medium Red £ and much better than Medium Control colour £.

Size was a catalyst that exaggerated the effect (in either direction)
Contrast/Colour triggered a change but it was not always positive.
It depended WHAT it was highlighting.

We were able to compare % in control colour, in control colour but bigger, in Red, in Red and bigger
In Control Colour and Red at Medium size there was no clear difference.
Small negative for % in Red in Medium size but it could have been assumed to be a neutral change had it been an unclear result on a standalone AB

Biggest and Big combined with Control crossed out was worse than Control
Not by much, other factors were more important but it was measurable

With MVT We were able to understand how the parts of the changes related to each other, which bits compounded effects, which conflicted

Now depending on the order you ran these as ABns you may or may not have detected this nuance

Simply running their original Best New Design would have gained the positive of Red Savings message - offset by an unquantified negative from being a % age Saving and a small positive from crossing out the Control colour of Price Before.

If they had tested Bigger first and seen Bigger is better that would have been the new Control
But Bigger Crossed Out and Bigger Not Crossed out were negative – making these more obvious changed neutral effect to negative

Bigger % (Red or Control) was negative, Red being more contrast than the control colour – more obvious was more negative when the change made to % was a negative one

Now if they had tested Bigger as better, Red as Better as the first two tests they'd have assumed Big Red was better

So % age Saved being Big Red would have been assumed to be neutral or positive.
But it compounded the impact of other changes
What was barely noticeable negative at default size was clearly negative if made Bigger or Red, very negative if both

So they could have tested % as the third test and gone backwards
Or they could have tested % as the first test seen no clear negative and implemented it.
Then assumed Big and Red were not positive changes when they ran those tests to follow up

Ultimately the best combination was Big, Red, £, with the previous price Grey Crossed out or Hidden completely

Grey Crossed out and Hidden did ok when looked at in isolation as a Factor, their AB within the AB within the MVT
Hidden >= Grey Crossed Out > Grey Not Crossed Out > Crossed Out > Not Crossed out
Crossed out > Not Crossed out
Grey > Not Grey
Hidden > All of them.
But in comparison to the shift seen in other combinations none of the Price Before variants were particularly impactful
In fact the only clearly negative combination was the high Contrast control (not crossed out) when combined with the bigger sizes. So our Control actually dropped in performance if we made it bigger, not something that would have come out unless the test order had been precisely correct to show this.

So this Price Before area of most debate within the business, with more ideas than would have been put into the ABn test and may have seen lots of testing around that area – showed almost zero impact when compared to other bigger changes.

Looked at in isolation several tests could have been run to move the needle by a fraction.
Because we'd only be judging it against other Price Before tests and variants
In comparison to the other areas it was clear this was not an important area

And this is most clearly shown by the "least damaging" variant being Hide it completely.
If you can remove something completely and the effect is neutral or positive...
the users didn't care as much as you did.

Grey Crossed out was ultimately what they went with due to legal concerns in not having a Before Price. Of the ones that kept the Price Before visible this low contrast version was the least impactful. Which fits the overall hypothesis - If important information is clearer we will see positive results. In this case we made it less clear, so this is not important information.

The eventual uplift, keeping the Good Stuff (Bigger Price section, Red for Savings Message, £ for How much saved, hide or make Before Price as little noise as possible) was closer to 5%.

We had the combination they wanted to launch and simply "run the winner" as an AB in the test and we were able to see it stabilise around 2%.

So by being able to look at which elements worked together, which contrasted, how that whole was made up we got far more intelligence on the ingredients and the influence.
And we were able to spot the areas of their preferred design that dragged down the good elements and advise against those changes.

Now this test and its follow ups ran for a fairly extended period so this wasn't a one-off result. I picked this as an example as we had lots of time testing this, testing repeatedly on this page and never once saw a result that was even close to contradicting the data we'd seen on previous tests. The shifts were small but the confidence levels were definitive and the error intervals were tiny.

We were able to repeat this and try alternative approaches but the overall pattern remained the same - Size and Bright contrast for the thing they wanted attention on had the most impact, other noise worked best when hidden or given as little focus as possible.

You can argue that a set of ABn would have eventually reached the same conclusion
But an MVT quantified what was important and what was noise to be hidden.

However done as a bunch of changes in one AB the small effect (positive and negative) of some of these areas would have been drowned out if tested against earlier Winners. Tested independently some of these low importance areas would have appeared more important than they were, in terms of their effect on the page.

If Red % had been tested first (as they wanted) and 2% uplift was the win… testing Bigger would have shown lower performance or possibly negative.
So we could have discounted a change they'd considered that actually proved to make the most difference. Because they would have added the Accelerator and highlighted the negative aspect of their new design more than the positive. They might have even thought "well Big and Red does look scary, lets test to make it smaller and stand out less"

When Big and Red in reality were both consistently positive… as long as they highlighted the right elements

Because they hadn’t identified that Red was positive when combined with £ and that when highlighted that % was negative they would have assumed that Bigger was not the direction to investigate.

The direction shown by AB would have been incorrect.

They may have found it to be an area to avoid. They would have missed out on further possible ~3% uplift that was there on the page from not understanding which parts in which combination worked together and which exaggerated negatives.

It’s possible they could have sub divided all these into their component parts so every possible permutation was in every possible AB test AND run them in an order that would not have had them dismiss what was the ultimate winners. But it’s unlikely that they would be that lucky or persist in adding back in "negative" variants to try in a new setting.

The other interesting follow up was the mixed performance of %age savings
We had this segmented so whilst on aggregate it was negative and increasingly so the more it was highlighted we could see it worked in some areas of the site, not others.

We tested around this a fair bit further in the end.
This first test had aggregated performance on all PDP templates.

Ultimately some related testing discovered that users didn't particularly care if it was £ or % saving. What mattered was the value of the number.

Bigger Value = Better.
Users didn’t appear to read if it was % or £
So if £20 represented 2.5% saving... then Percentage saved wasn't positive.
2.5 looks small compared to 20
But if £3 saving represented 30% off previous price then Percentage was positive.
30 looks much bigger than 3

In our initial rounds of testing making the savings % age bigger had simply highlighted when the number value being shown wasn't impressive.

In this case the £ savings were a bigger number for the user to read on more parts of the site. Make that bigger, better effect.
Make Save 1.3% bigger and Redder, actually create a negative effect.

So in follow up testing we were able to define which combination of £ or % should be shown depending on how big that number was, highlighting it when it was over a “good” threshold, not highlighting it when it wasn’t.

That way departments could be consistent in whether they showed % or £ saved but we could keep the positive and mitigate the negative by highlighting that less.

And yes, we did that with MVT too.

over 2 years ago


Luke Hardwick, Web Optimisation Consultant at SiteSpect Ltd

I'm inclined to agree with Tim on this one. Done well MVT can deliver great results and give you more detail on which of the changes within your design was most/least impactful. But it needs to be properly researched and structured with a solid hypothesis for each element. ABn testing can suffer just as badly from poor research, structuring and a lack of hypothesis. Ultimately, with both approaches you only get out of them what you invest at the beginning.

over 2 years ago

Paul Rouke

Paul Rouke, Founder & CEO at PRWD

@Darren - thanks for that. I agree, there are many businesses who should be getting the right foundations in place for measuring and understanding their current user experience before getting drawn in to investing in expensive testing tools and jumping in to MVT.

@Greg - thank you very much for your comments and insights on your work, much appreciated. I like your summary "This lack of understanding leads to the shotgun approach hoping something will stick." You have also highlighted a hugely important area around stakeholder engagement and strategic understanding which can't be underestimated at all.

@Jennie - thank you very much for your feedback on the approx split of tests you are running. Out of interest approx what is your number of tests per month? It sounds like it will quite high across the 2 types.

@Tim - thank you very much for your feedback from TSB's perspective and what is working for you, and also for how you are managing stakeholder expectations and confidence in optimisation for growing the business.

@William - thanks for sharing your thoughts. What you say here can't be underestimated at all, this is so crucial: "The trouble is data and time are usually limited so if you want to achieve success you need to use your skills to fast track to the best combinations rather than rely on a machine to work it out for you." Data and time being usually limited as the 2 key drivers here.

@Tim - wow! I'm not sure what to say to this - possibly the biggest ever comment to a blog post?! I'm really glad I have perked your interest and prompted you to reply, it much appreciated, not least with the time it would have taken you.

I'm interested to know your thoughts on what type of minimum monthly traffic/conversion level you feel a business needs to be above to begin adopting some of the layered and detailed MVT approaches you have explained here? I appreciate that it sounds like you have worked with a lot of enterprise clients with millions of visitors per month to work with. Also, what do you typically do when you want to begin segmenting your MVT to look at the behaviour of different audiences within your MVT? Compared to segmenting traffic for a A/Bn test?

over 2 years ago

Tim Stewart

Tim Stewart, Optimisation Consultant at trsdigital Ltd


Obviously on the bigger clients something as I described is entirely possible and from memory that original run was only 3 or so weeks. In terms of size I'd say you need at least 1000 conversions on your decision metric a month and you'd need to plan your test size to suit.

MVT allows you to scale test sizes very quickly, even the biggest sites may have the traffic for >512 combination tests but I would rarely see much above 256 because you could usually chain a set of MVT and ABn tests together and learn more. Generally if your combinations are so numerous you either haven't planned your hypotheses well enough and included every unrelated idea, or, your ideas are so subtly different that the traffic needed to split the difference is likely to be prohibitive. Less is more, even if MVT will allow you to test a huge amount in one test I'd rarely see the use case to justify this.

I think the largest I have seen run was in the 500 combination range but you can cull a whole element if its shown to not be adding value and that can drastically reduce the combinations being tested. So typically tests will be adjusted and the report data time-segmented so you can look at the whole and iterations. Don't forget you can look at the separate factors in reports and have the sum data of all combinations with that variant and see useful data much sooner than you will the final individual combinations. If each section has a hypothesis and each variant within it has a proven/disproven case for its inclusion in that section your preliminary reporting is often at this factor/element level.

And typically this is where you would see the strongest effect early on for Segments - if each factor being changed is a sum of the variants combined with all/any variants in the other factors then you do tend to see a strong indication whether it is appealing/unappealing to a segment or (more often) has a greater/less impact on a segment. So whilst the test is still running you can start to look at why a Segment reacted strongly/not at all to a particular part of the page and start planning a follow up investigation.

The client in the example is a large retailer, every fraction of a per cent was worth a lot of money. Every fraction of risk therefore for untested ideas was potentially very expensive too. People often talk about the value of iterating fast to grab wins, I often spend my time quantifying the potential range of risk of ideas that would have otherwise been released without testing. In this case we spent much more time looking into that page and made some significant incremental gains. That particular one was I felt a decent example of when an MVT led to a discovery, disproved some assumptions and led directly to follow up tests that would have been hard or impossible with "just" ABn - which was what you were looking for when you posed the question

But I guess that comes down to your point on opportunity cost. The risk reward for investigating and implementing the learnings was a very positive balance - a few months of related testing around the Action area of the PDP resulted in a fundamental increase in the benchmark of Product Page to Basket conversion. Nearly tripled step conversion IIRC after we finished the first pass. Don't recall/can't share the exact numbers but it was at least in the order of 6 figure difference in order value per month. > 20x the cost of the tool and the time

That said I have run MVT successfully on sites with much lower traffic, in the order of 35k a month (vs 500k a day) and as long as you are sensible with the test size and are working with metrics that have enough volume (like clicks into Basket rather than completed orders) then you can still run and conclude insightful tests within a useful timeframe.

Regarding segmentation - yes if you are subdividing your audience then again you need to think about the scale of the test (ABn or MVT). It also depends on the tool and what you are doing with Segmentation.

Some tools let you target and run a test against a Segment (say Mobile vs. Desktop or Organic vs Paid), excluding other Segments. That will affect your sample size (and unless your site allows you to serve different sites to different segments can be frustrating as implementing conflicting winners is never easy). Some let you run the test across the full audience but use Segment filters on the results, so you might have small confidence intervals and high significance for the full dataset but need to wait a little longer until you are confident on the smallest segment. That's a more typical use case on big tests, follow up tests can then be run to address segment-specific conversion issues identified in the first test.

I would often see that split on New vs Return user segments. One of those segments would be much smaller than the other, e.g. Return being much larger proportion of Buyers and the aggregate of their positive reaction would mask a less positive or even negative response from New. In those cases it depended on the business objective - most possible sales (appeal to Return accept higher than ideal losses on New) or customer base growth (appeal to New and focus on acquisition but knowing the design was less optimal for Return) .

I often see this on the Login/Register/Guest page. Which design "worked" was more often a symptom of what the proportion of each user type was to the page and something to be very cautious of when interpreting aggregated results.

Some testing systems will actually layer Segments into the testing, automating the segment discovery and the "which variant to show" test so that the optimisation of the test is adjusted to suit and the best case for all segments is what is tested. But again that multiplies up the traffic you need and can mean you reach significance on high volume segments much sooner than smaller ones.

over 2 years ago


Jon Taylor, Head of Digital at YBS

I'm not sure your attack on MVT is warranted. Some of the key foundations of your argument 1) MVT lacks a reason why and 2) its difficult to unpick what MVT is telling you, are more to do with poor testing practices that the particular technology.

on 1) as far as I'm concerned the phrase 'test without a hypothesis' is an oxymoron. If you have no hypothesis then you are what I'd call, messing about.

on 2) if you set the parameters of your testing too widely then you will not get clear results.

MVT is useful and I've seen it deliver significant improvements in conversion, the same for A-B testing. As the mega-poster suggested, each has their own place, but each also require planning - MVT typically moreso than A-B.

Also, don't forget that A-B is a subset of MVT, just you only have two variants. This in itself implies the challenge with poorly executed MVT testing, narrower is better, but not necessarily 'best'.

over 2 years ago

Save or Cancel

Enjoying this article?

Get more just like this, delivered to your inbox.

Keep up to date with the latest analysis, inspiration and learning from the Econsultancy blog with our free Digital Pulse newsletter. You will receive a hand-picked digest of the latest and greatest articles, as well as snippets of new market data, best practice guides and trends research.