When is the best time to call time on a split test??
One of the biggest issues when running a split test is knowing how long to run it for. Confusingly, depending on the split testing tool you use it may tell you when to stop or it may even stop the test for you based on it’s own statistical calculations. For the uninitiated that might sound like job done but in fact it’s misleading and possibly injurious to the final objective. You need to consider a few other things:
For the purpose of split testing, the business cycle, i.e. aggregate trade activity over a given period of time, is often thought of in terms of weeks or months. Years is rather too long to call a result even if it is likely to be statistically more robust. Days are pretty pointless, they are far too short a time frame especially when you consider that conversion rates and traffic volumes will often fluctuate on a daily basis over the course of a week; also it is still the case that for a large portion of the population that two out of seven days in the week are still non working days.
Because of that when running a split testing program the tester needs to make sure that each test is run over a complete business cycle or combination of business cycles.
Purists will argue that the run time of the test should also not favour any one part of the business cycle, so that if your cycle is a week and the test is started at 9am on a Monday morning then it should also end at 08:59 (or 9am) on a Monday morning. Doing this you will be less likely to skew the results by including one part of a business cycle (which may favour one of the variants) but not another.
Purchasing cycle (or decision-making process)
This is perhaps a footnote to the point above. On most e-commerce sites the decision-making process involved in buying a product or service may take longer than a day.
To get a better picture of how this looks go into your web analytics package and look for a report that shows you time lag and / or path length.
In Google Analytics you can find this in the Multi-Channel Funnels reporting area. These reports will show you the number of interactions (Path Length) or the number of days (Time Lag) it takes for customers to buy something. Based on this you will then have an idea of your purchasing cycle.
If you have an especially long purchasing cycle then your test time frame will need to take this into account.
Volume of data
The short answer here is ‘the more the better’. That said it’s impractical to wait for months and months to call a result and really for many e-commerce sites the purchasing cycle is not likely to be that long.
So in most cases it is considered that 300 good outcomes / sales per variant is a minimum.
BUT, if you have your testing tool hooked up to your web analytics tool specifically so you can segment your test results then 300 good outcomes might not be enough. The obvious reason here is that by segmenting a dataset with 300 good outcomes you will inevitably wind up with a smaller less robust dataset with fewer good outcomes and therefore not as reliable. So in these instances you will need to extend the minimum sample size to accommodate your segmentation.
Regression towards the mean
This is the phenomenon where a variable that was extreme on its first measurement tends to be closer to the average on its second measurement or where a variable that is extreme on its second measurement tends to have been closer to the average on its first. In split testing this often reveals itself in circumstances where both variants converge.
This process doesn’t happen overnight. So often at the start of a test the data can look fantastic with the new variant set to bring home the bacon in style, but of course:
1. The sample will most likely be far too small unless you have a high traffic site or you are adopting very broad targeting for the experiment.
2. Partly as a result of (a), as the sample grows the phenomenon of regression to the mean will reveal itself.
So by ensuring that you collect a robust sample of data AND you run for a minimum of two business cycles (preferably more) you should ensure that you will have accommodated for this phenomenon.
It’s important to remember that the criteria above are not a pick and mix set of options. It’s not the case that if you are able to collect 300 good outcomes per variant before the you have completed one business cycle (due to high traffic volumes) then you have a reliable result, even if your testing tool calls it. You should wait until you have a minimum sample size PLUS at least one full cycle (preferably two), this underscores the point that generally speaking the more data you have the more robust your results will be.
There is another, possibly rather more obscure reason, why tests need to run for longer although and this is harder to measure but is potentially linked to the final point about regression to the mean.
When running a split test you are, by definition, introducing your customers and returning visitors to a new page layout and possibly even a new process on your site. Some people will take to the newness quicker than others, but once it has been in the wild for long enough and your sample has become used to it, they may only then start to find it easier to use and only then might it start to yield a positive return when compared to the control. This is basically the inflexion point of the test.
Key takeways on planning your time to test
Of course it may be that the inflexion point is never reached but all this is to say that:
1. There are some clear guidelines and principles as to how long you should run a test for.
2. If you see your test variant performing atrociously to begin with do not think that it stands no chance of ever recovering and pull the test in order to avoid losing revenue.
3. Remember that things take time to bed in.