Comparing the empirical distribution of a variable across different groups is a common problem in data science. In particular, in causal inference, the problem often arises when we have to assess the quality of randomization. Show
When we want to assess the causal effect of a policy (or UX feature, ad campaign, drug, …), the golden standard in causal inference is randomized control trials, also known as A/B tests. In practice, we select a sample for the study and randomly split it into a control and a treatment group, and we compare the outcomes between the two groups. Randomization ensures that the only difference between the two groups is the treatment, on average, so that we can attribute outcome differences to the treatment effect. The problem is that, despite randomization, the two groups are never identical. However, sometimes, they are not even “similar”. For example, we might have more males in one group, or older people, etc.. (we usually call these characteristics covariates or control variables). When it happens, we cannot be certain anymore that the difference in the outcome is only due to the treatment and cannot be attributed to the imbalanced covariates instead. Therefore, it is always important, after randomization, to check whether all observed variables are balanced across groups and whether there are no systematic differences. Another option, to be certain ex-ante that certain covariates are balanced, is stratified sampling. In this blog post, we are going to see different ways to compare two (or more) distributions and assess the magnitude and significance of their difference. We are going to consider two different approaches, visual and statistical. The two approaches generally trade off intuition with rigor: from plots, we can quickly assess and explore differences, but it’s hard to tell whether these differences are systematic or due to noise. Example Let’s assume we need to perform an experiment on a group of individuals and we have randomized them into a treatment and control group. We would like them to be as comparable as possible, in order to attribute any difference between the two groups to the treatment effect alone. We also have divided the treatment group into different arms for testing different treatments (e.g. slight variations of the same drug). For this example, I have simulated a dataset of 1000 individuals, for whom we observe a set of characteristics. I import the data generating process sns.histplot(data=df, x='Income', hue='Group', bins=50); 3 from sns.histplot(data=df, x='Income', hue='Group', bins=50); 4 and some plotting functions and libraries from sns.histplot(data=df, x='Income', hue='Group', bins=50); 5.from src.utils import * We have information on 1000 individuals, for which we observe sns.histplot(data=df, x='Income', hue='Group', bins=50); 6, sns.histplot(data=df, x='Income', hue='Group', bins=50); 7 and weekly sns.histplot(data=df, x='Income', hue='Group', bins=50); 8. Each individual is assigned either to the treatment or control sns.histplot(data=df, x='Income', hue='Group', bins=50); 9 and treated individuals are distributed across four treatment sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 0.Two Groups — Plots Let’s start with the simplest setting: we want to compare the distribution of income across the sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 1 and sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 2 group. We first explore visual approaches and then statistical approaches. The advantage of the first is intuition while the advantage of the second is rigor.For most visualizations, I am going to use Python’s sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 3 library.BoxplotA first visual approach is the boxplot. The boxplot is a good trade-off between summary statistics and data visualization. The center of the box represents the median while the borders represent the first (Q1) and third quartile (Q3), respectively. The whiskers instead extend to the first data points that are more than 1.5 times the interquartile range (Q3 — Q1) outside the box. The points that fall outside of the whiskers are plotted individually and are usually considered outliers. Therefore, the boxplot provides both summary statistics (the box and the whiskers) and direct data visualization (the outliers). sns.boxplot(data=df, x='Group', y='Income'); It seems that the sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 distribution in the sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 1 group is slightly more dispersed: the orange box is larger and its whiskers cover a wider range. However, the issue with the boxplot is that it hides the shape of the data, telling us some summary statistics but not showing us the actual data distribution.HistogramThe most intuitive way to plot a distribution is the histogram. The histogram groups the data into equally wide bins and plots the number of observations within each bin. sns.histplot(data=df, x='Income', hue='Group', bins=50); There are multiple issues with this plot:
We can solve the first issue using the sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 6 option to plot the sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 7 instead of the count and setting the sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 8 option to sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 9 to normalize each histogram separately.sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); Now the two histograms are comparable! However, an important issue remains: the size of the bins is arbitrary. In the extreme, if we bunch the data less, we end up with bins with at most one observation, if we bunch the data more, we end up with a single bin. In both cases, if we exaggerate, the plot loses informativeness. This is a classical bias-variance trade-off. Kernel DensityOne possible solution is to use a kernel density function that tries to approximate the histogram with a continuous function, using kernel density estimation (KDE). sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); From the plot, it seems that the estimated kernel density of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 has "fatter tails" (i.e. higher variance) in the sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 1 group, while the average seems similar across groups.The issue with kernel density estimation is that it is a bit of a black box and might mask relevant features of the data. Cumulative DistributionA more transparent representation of the two distributions is their cumulative distribution function. At each point of the x-axis ( sns.histplot(data=df, x='Income', hue='Group', bins=50); 8) we plot the percentage of data points that have an equal or lower value. The main advantages of the cumulative distribution function are that
sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density", How should we interpret the graph?
Q-Q PlotA related method is the Q-Q plot, where q stands for quantile. The Q-Q plot plots the quantiles of the two distributions against each other. If the distributions are the same, we should get a 45-degree line. There is no native Q-Q plot function in Python and, while the sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 4 package provides a sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 5 function, it is quite cumbersome. Therefore, we will do it by hand.First, we need to compute the quartiles of the two groups, using the sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 6 function.income = df['Income'].values Now we can plot the two quantile distributions against each other, plus the 45-degree line, representing the benchmark perfect fit. plt.figure(figsize=(8, 8)) The Q-Q plot delivers a very similar insight with respect to the cumulative distribution plot: income in the treatment group has the same median (lines cross in the center) but wider tails (dots are below the line on the left end and above on the right end). Two Groups — Tests So far, we have seen different ways to visualize differences between distributions. The main advantage of visualization is intuition: we can eyeball the differences and intuitively assess them. However, we might want to be more rigorous and try to assess the statistical significance of the difference between the distributions, i.e. answer the question “is the observed difference systematic or due to sampling noise?”. We are now going to analyze different tests to discern two distributions from each other. T-testThe first and most common test is the student t-test. T-tests are generally used to compare means. In this case, we want to test whether the means of the sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 distribution are the same across the two groups. The test statistic for the two-means comparison test is given by:Where x̅ is the sample mean and s is the sample standard deviation. Under mild conditions, the test statistic is asymptotically distributed as a Student t distribution. We use the sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 8 function from sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 9 to perform the t-test. The function returns both the test statistic and the implied p-value.from scipy.stats import ttest_ind The p-value of the test is 0.12, therefore we do not reject the null hypothesis of no difference in means across treatment and control groups.
Standardized Mean Difference (SMD)In general, it is good practice to always perform a test for differences in means on all variables across the treatment and control group, when we are running a randomized control trial or A/B test. However, since the denominator of the t-test statistic depends on the sample size, the t-test has been criticized for making p-values hard to compare across studies. In fact, we may obtain a significant result in an experiment with a very small magnitude of difference but a large sample size while we may obtain a non-significant result in an experiment with a large magnitude of difference but a small sample size. One solution that has been proposed is the standardized mean difference (SMD). As the name suggests, this is not a proper test statistic, but just a standardized difference, which can be computed as: Usually, a value below 0.1 is considered a “small” difference. It is good practice to collect average values of all variables across treatment and control groups and a measure of distance between the two — either the t-test or the SMD — into a table that is called balance table. We can use the sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density", 0 function from the sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density", 1 library to generate it. As the name of the function suggests, the balance table should always be the first table you present when performing an A/B test.from causalml.match import create_table_one In the first two columns, we can see the average of the different variables across the treatment and control groups, with standard errors in parenthesis. In the last column, the values of the SMD indicate a standardized difference of more than 0.1 for all variables, suggesting that the two groups are probably different. Mann–Whitney U TestAn alternative test is the Mann–Whitney U test. The null hypothesis for this test is that the two groups have the same distribution, while the alternative hypothesis is that one group has larger (or smaller) values than the other. Different from the other tests we have seen so far, the Mann–Whitney U test is agnostic to outliers and concentrates on the center of the distribution. The test procedure is the following.
Under the null hypothesis of no systematic rank differences between the two distributions (i.e. same median), the test statistic is asymptotically normally distributed with known mean and variance. The intuition behind the computation of R and U is the following: if the values in the first sample were all bigger than the values in the second sample, then R₁ = n₁(n₁ + 1)/2 and, as a consequence, U₁ would then be zero (minimum attainable value). Otherwise, if the two samples were similar, U₁ and U₂ would be very close to n₁ n₂ / 2 (maximum attainable value). We perform the test using the sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density", 2 function from sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 9.sns.boxplot(data=df, x='Group', y='Income'); 0We get a p-value of 0.6 which implies that we do not reject the null hypothesis that the distribution of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 is the same in the treatment and control groups.
Permutation TestsA non-parametric alternative is permutation testing. The idea is that, under the null hypothesis, the two distributions should be the same, therefore shuffling the sns.histplot(data=df, x='Income', hue='Group', bins=50); 9 labels should not significantly alter any statistic.We can choose any statistic and check how its value in the original sample compares with its distribution across sns.histplot(data=df, x='Income', hue='Group', bins=50); 9 label permutations. For example, let's use as a test statistic the difference in sample means between the treatment and control groups.sns.boxplot(data=df, x='Group', y='Income'); 1The permutation test gives us a p-value of 0.053, implying a weak non-rejection of the null hypothesis at the 5% level. How do we interpret the p-value? It means that the difference in means in the data is larger than 1–0.0560 = 94.4% of the differences in means across the permuted samples. We can visualize the test, by plotting the distribution of the test statistic across permutations against its sample value. sns.boxplot(data=df, x='Group', y='Income'); 2As we can see, the sample statistic is quite extreme with respect to the values in the permuted samples, but not excessively. Chi-Squared TestThe chi-squared test is a very powerful test that is mostly used to test differences in frequencies. One of the least known applications of the chi-squared test is testing the similarity between two distributions. The idea is to bin the observations of the two groups. If the two distributions were the same, we would expect the same frequency of observations in each bin. Importantly, we need enough observations in each bin, in order for the test to be valid. I generate bins corresponding to deciles of the distribution of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 in the control group and then I compute the expected number of observations in each bin in the treatment group if the two distributions were the same.sns.boxplot(data=df, x='Group', y='Income'); 3We can now perform the test by comparing the expected (E) and observed (O) number of observations in the treatment group, across bins. The test statistic is given by where the bins are indexed by i and O is the observed number of data points in bin i and E is the expected number of data points in bin i. Since we generated the bins using deciles of the distribution of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 in the control group, we expect the number of observations per bin in the treatment group to be the same across bins. The test statistic is asymptotically distributed as a chi-squared distribution.To compute the test statistic and the p-value of the test, we use the sns.histplot(x='Income', data=df, hue='Group', bins=len(df), stat="density", 9 function from sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 9.sns.boxplot(data=df, x='Group', y='Income'); 4Differently from all other tests so far, the chi-squared test strongly rejects the null hypothesis that the two distributions are the same. Why? The reason lies in the fact that the two distributions have a similar center but different tails and the chi-squared test tests the similarity along the whole distribution and not only in the center, as we were doing with the previous tests. This result tells a cautionary tale: it is very important to understand what you are actually testing before drawing blind conclusions from a p-value! Kolmogorov-Smirnov TestThe Kolmogorov-Smirnov test is probably the most popular non-parametric test to compare distributions. The idea of the Kolmogorov-Smirnov test is to compare the cumulative distributions of the two groups. In particular, the Kolmogorov-Smirnov test statistic is the maximum absolute difference between the two cumulative distributions. Where F₁ and F₂ are the two cumulative distribution functions and x are the values of the underlying variable. The asymptotic distribution of the Kolmogorov-Smirnov test statistic is Kolmogorov distributed. To better understand the test, let’s plot the cumulative distribution functions and the test statistic. First, we compute the cumulative distribution functions. sns.boxplot(data=df, x='Group', y='Income'); 5We now need to find the point where the absolute distance between the cumulative distribution functions is largest. sns.boxplot(data=df, x='Group', y='Income'); 6We can visualize the value of the test statistic, by plotting the two cumulative distribution functions and the value of the test statistic. sns.boxplot(data=df, x='Group', y='Income'); 7From the plot, we can see that the value of the test statistic corresponds to the distance between the two cumulative distributions at sns.histplot(data=df, x='Income', hue='Group', bins=50); 8~650. For that value of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8, we have the largest imbalance between the two groups.We can now perform the actual test using the income = df['Income'].values 3 function from sns.kdeplot(x='Income', data=df, hue='Group', common_norm=False); 9.sns.boxplot(data=df, x='Group', y='Income'); 8The p-value is below 5%: we reject the null hypothesis that the two distributions are the same, with 95% confidence.
Multiple Groups — Plots So far we have only considered the case of two groups: treatment and control. But that if we had multiple groups? Some of the methods we have seen above scale well, while others don’t. As a working example, we are now going to check whether the distribution of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 is the same across treatment sns.histplot(data=df, x='Income', hue='Group', bins=50, stat='density', common_norm=False); 0.BoxplotThe boxplot scales very well when we have a number of groups in the single-digits since we can put the different boxes side-by-side. sns.boxplot(data=df, x='Group', y='Income'); 9From the plot, it looks like the distribution of sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 is different across treatment arms, with higher numbered arms having a higher average income.Violin PlotA very nice extension of the boxplot that combines summary statistics and kernel density estimation is the violin plot. The violin plot displays separate densities along the y axis so that they don’t overlap. By default, it also adds a miniature boxplot inside. sns.histplot(data=df, x='Income', hue='Group', bins=50); 0As for the boxplot, the violin plot suggests that income is different across treatment arms. Ridgeline PlotLastly, the ridgeline plot plots multiple kernel density distributions along the x-axis, making them more intuitive than the violin plot but partially overlapping them. Unfortunately, there is no default ridgeline plot neither in income = df['Income'].values 8 nor in income = df['Income'].values 9. We need to import it from plt.figure(figsize=(8, 8)) 0.sns.histplot(data=df, x='Income', hue='Group', bins=50); 1Again, the ridgeline plot suggests that higher numbered treatment arms have higher income. From this plot, it is also easier to appreciate the different shapes of the distributions. Multiple Groups — Tests Lastly, let’s consider hypothesis tests to compare multiple groups. For simplicity, we will concentrate on the most popular one: the F-test. F-testWith multiple groups, the most popular test is the F-test. The F-test compares the variance of a variable across different groups. This analysis is also called analysis of variance, or ANOVA. In practice, the F-test statistic is given by Where G is the number of groups, N is the number of observations, x̅ is the overall mean and x̅g is the mean within group g. Under the null hypothesis of group independence, the f-statistic is F-distributed. sns.histplot(data=df, x='Income', hue='Group', bins=50); 2The test p-value is basically zero, implying a strong rejection of the null hypothesis of no differences in the sns.histplot(data=df, x='Income', hue='Group', bins=50); 8 distribution across treatment arms.Conclusion In this post, we have seen a ton of different ways to compare two or more distributions, both visually and statistically. This is a primary concern in many applications, but especially in causal inference where we use randomization to make treatment and control groups as comparable as possible. We have also seen how different methods might be better suited for different situations. Visual methods are great to build intuition, but statistical methods are essential for decision-making since we need to be able to assess the magnitude and statistical significance of the differences. References[1] Student, The Probable Error of a Mean (1908), Biometrika. [2] F. Wilcoxon, Individual Comparisons by Ranking Methods (1945), Biometrics Bulletin. [3] B. L. Welch, The generalization of “Student’s” problem when several different population variances are involved (1947), Biometrika. [4] H. B. Mann, D. R. Whitney, On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other (1947), The Annals of Mathematical Statistics. [5] E. Brunner, U. Munzen, The Nonparametric Behrens-Fisher Problem: Asymptotic Theory and a Small-Sample Approximation (2000), Biometrical Journal. [6] A. N. Kolmogorov, Sulla determinazione empirica di una legge di distribuzione (1933), Giorn. Ist. Ital. Attuar.. [7] H. Cramér, On the composition of elementary errors (1928), Scandinavian Actuarial Journal. [8] R. von Mises, Wahrscheinlichkeit statistik und wahrheit (1936), Bulletin of the American Mathematical Society. [9] T. W. Anderson, D. A. Darling, Asymptotic Theory of Certain “Goodness of Fit” Criteria Based on Stochastic Processes (1953), The Annals of Mathematical Statistics. Related Articles
CodeYou can find the original Jupyter Notebook here: Blog-Posts/distr.ipynb at main · matteocourthoud/Blog-PostsCode and notebooks for my blog posts. Contribute to matteocourthoud/Blog-Posts development by creating an account on…github.com Thank you for reading!I really appreciate it! 🤗 If you liked the post and would like to see more, consider following me. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations. Also, a small disclaimer: I write to learn so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics! What is the best statistical test to compare two groups?anova - Best statistical test to compare two groups when they have different distributions - Cross Validated.
What test is used to compare observations between two groups?Comparison tests
They can be used to test the effect of a categorical variable on the mean value of some other characteristic. T-tests are used when comparing the means of precisely two groups (e.g., the average heights of men and women).
|