A note on Simpson's Paradox

Recently on multiple occasions, I came across this thing called Simpson's Paradox.

If some relationship exists between two variables in the data, the relationship will not hold as we partition the data into groups.

For example, during the early days of COVID-19, China reported lower mortality rates due to COVID as compared to Italy. But as we partition the data according to the age group, all the age groups have a lower mortality rate for Italy.

So why the opposite conclusion?

Due to more people belonging to the older age group in Italy, Italy has an overall higher mortality rate.

Let's look at a couple of more examples to make things concrete.

Source: Simpson’s Paradox: a cautionary tale in advanced analytics(significancemagazine.com)

In the above figure, we have price versus demand for a product. If we consider the overall period (all weeks), we do not see any trend, but the moment we group the data based on different time frames, we see a downward trend in demand with an increase in prices.

Source: Sprenger, Jan and Naftali Weinberger, "Simpson’s Paradox", The Stanford Encyclopedia of Philosophy (Summer 2021 Edition), Edward N. Zalta (ed.)

In the above table, we have two groups, treatment ( the group to which we provide the treatment) and control (we do not give the treatment). If you observe the success rate for the treatment group for the overall population, it is similar to that of the control group (50%). Indicating treatment looks ineffective. But, the moment we bring the gender in, the treatment appears to be effective for males (61% versus 57%) and females (44% versus 40%).

An application to machine learning:

Suppose you are doing an image classification task. You observe that model A has better accuracy than model B. But, someone pointed out that even though model A has better accuracy overall, model B performed well on low-resolution images. The reason it has better accuracy overall is that the data is imbalanced. That is, we have more number of higher-resolution images. But, the pictures you get from the real world are more likely to be in lower resolution. Hence, it is wiser to deploy model B for this specific task.

A small technical detail:

The variable used to group the data is known as the confounding variable. Confounding variables can affect the relationship between the independent variable (the variable being manipulated or studied) and the dependent variable (the variable being measured or observed). In the above example, image resolution is our confounding variable.

A note on Simpson's Paradox¶