A note on correlation¶

Correlation is a metric which measures the strength of a linear relationship between variables.

While looking at the correlation between two variables, it is also important to visualize it.

The reason is: To see what a correlation value exactly means.

Below we visualize some of Pearson's correlation coefficients.

The above are the scatter plots for different values of a correlation coefficient.

The difference between a correlation of 0.7 and 0.5 is not the same as that of 0.7 and 0.9.

In terms of information content, a correlation of 0.7 is closer to 0.5 than to 0.9.

Similarly, a correlation of 0.5 is closer to 0 than to 1.

What does that mean?

It is more likely that a relationship does not exist.

Technical Note:¶

The variables X and Y follow a standard normal distribution.

We use mutual information to measure the information content.

Mutual information between two gaussian random variables X and Y is given by:

$$ I(X;Y) = -\frac{1}{2}\ln(1-\rho^2) $$

$$ \text{where }\rho\text{ is the correlation coefficient between X and Y}$$

For a correlation of 0.5, I(X;Y) is 0.14384

Similarly, for a correlation of 0.7, I(X;Y) is 0.33667

Also, for a correlation of 0.9, I(X;Y) is 0.83036

Hence, a correlation of 0.7 is closer to 0.5 than to 0.9, in terms of information content.

R-code¶

samples=500
library(MASS)
r=c(0, 0.5, 0.7, 0.9)
par(mfrow=c(2,2))
for( i in 1:length(r))
{
data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r[i], r[i], 1), nrow=2), empirical=TRUE)
X = data[, 1]  # standard normal (mu=0, sd=1)
Y = data[, 2]  # standard normal (mu=0, sd=1)
plot(X,Y, main= paste("Corr = ", r[i]), col="red")
}