Hypothesis Testing in A/B Testing
In my last post, I went through hypothesis testing with a simple coin flipping example.
Now we have established that Hypothesis testing is a way of systematically quantifying how certain you are of the result of a statistical experiment. It starts by forming a null hypothesis such as “Design A performs better”, and then converts it to a mathematical statement. Finally, we need to put in a probability distribution to test it using a specific confidence level. In today’s article, we will apply this to our real world analytics application: A/B Testing.
What is A/B Testing?
A/B testing is one of primary tools in any data-driven environment. It’s a way of conducting experiments where you compare a baseline control sample to one or more test samples by assigning each sample a specific single variable change.
For example, you might have a landing page that shows the latest products list. You’ll want to test various layouts to try and maximize the sales made by this page. Normally, we use “conversion rate” to measure the page’s performance. By assigning the control sample and each test sampling similar traffic, we can make a decision by observing the conversion rate.
The Fake Data
You might ask: “Why do we need hypothesis testing if we already have conversion rate to measure the performance?”
Assume you are running an email campaign to show the latest offers. You have 3 versions with different layouts: control sample, test sample A, and test sample B. You run an A/B test before formally running the campaign. Here are the results you might get:
|Version||visitors treated||Orders||Conversion Rate|
|Test Sample A||180||45||25%|
|Test Sample B||189||28||14.81%|
In terms of the results, could we make a judgement that A is best now? When the sample size is large, the results might turn into this:
|Version||visitor treated||Orders||Conversion Rate|
|Test Sample A||10000||2000||20%|
|Test Sample B||10000||1800||18.00%|
Don’t be surprised if you get these results, because the sample size matters. The problem is when you run an email campaign you wouldn’t be able to get a large test sample size. If you make a judgement only based on the comparison of the conversion rate, you might just make a wrong decision.
In order to avoid the wrong decision, you might need to use hypothesis testing to justify the results you get, especially for the small sample size or similar results.
Remember the statistics principle we mentioned in the last post: if a small probability event happens in your test sample, you could reject the null hypothesis. In this case, the null hypothesis could be that the conversion rate of the control treatment is no less than the conversion rate of our experimental treatment. So mathematically
H0 = P – Pc ≤ 0
Where Pc is the conversion rate of the control and P is the conversion rate of one of our experiments.
Therefore if the probability of H0 is low enough, we could reject it and go for the alternative hypothesis, that is “the experimental email campaign has a higher conversion rate”. That is what we want to see and quantify.
In order to measure the probability of H0, let’s say P(H0), we need to know its probability distribution. The sampled conversion rates are all normally distributed random variables just like the coin flipping. Instead of seeing whether it deviates too far from a fixed probability we want to measure whether it deviates too far from the control treatment. There is another statistic rule: the sum or difference of two normally distributed variables is itself normally distributed. With this rule, we could do Z-Test and calculate the 95% confidence interval like we did in the coin flipping example.
Mathematically, the calculation of Z-Score for the probability of H0 is:
Where N is the size of the experiment sample and Nc is the size of the control sample.
Do you remember that we used the z-score of 1.96 to correspond to the 95% confidence interval? This time it’s a little different. We will use 1.65 instead of 1.96. Why? In the coin flip example, the null hypothesis is P = 0.5. Therefore we could reject it if the probability is too high or too low, but this time we only care one way.
If the z-score falls in the blue part, we assume it is a small probability and reject the null hypothesis. In the coin flip example, the blue part distributes on both sides of the normal distribution. It is called a two tailed test.
In this A/B Testing example, we’ll only reject the null hypothesis if the experimental conversion rate is significantly higher than the control conversion rate. The blue part is only on the right trail of the normal distribution. This is called a one tailed test.
Using the formula above, we could get the following results:
|Version||visitors treated||Orders||Conversion Rate||Z-Score|
|Test Sample A||180||45||25%||1.33|
|Test Sample B||189||28||14.81%||-1.13|
We could find that none of the z-scores is large than 1.65, which means the results would likely change if the sample size goes larger. In this case, we can’t decide which one is best, and we’d need a larger sample size to get a right decision.Tweet