Welcome to NRICH.

 
Please explain correlation coefficient formula


By Brad Rodgers (P1930) on Wednesday, September 20, 2000 - 07:36 pm:

Can someone help explain what the formula


r=
å
(xi-
-
x
 
)(yi-
-
y
 
)

  æ
Ö

å
(xi-
-
x
 
)2× å
(yi-
-
y
 
)2
 

means?


Where x bar is the average x, and y bar is the average y. All I know about this is that it describes the correlation coeffecient for a line through a graph of scattered points.

If you can explain what it means, can you explain how it is formulated?

Thanks,

Brad
By Dave Sheridan (Dms22) on Thursday, September 21, 2000 - 02:16 pm:

This does look a little complicated if you don't know where the terms come from. I'll use Ex or E(x) to denote the average of x (read, expectation of x). The first interesting thing to think about if you have a set of data is what Ex is. This allows us to compare data sets to see whether one is, on average, bigger than the other. But this isn't particularly interesting because I could just add 100 to each point and obtain a larger data set which has the same properties as the original one.

Generally, we want to know "how far away" we are from the mean, or at least on average. Quite how to interpret this statement is difficult but there are various reasons why mean square is the most important. By definition, the sum of the "errors" obtained by saying "every point is actually the mean" is zero. However, if you square each of these errors before summation you'll get something non-negative and in all interesting situations (at least one point is not the mean) it'll be strictly positive. This is called the variance,
Var(X)=E((X-EX)2)=E(X2)-(EX)2,
where the first equality is a definition, variance being the expected value of the square of the difference between a point and its mean, and the second equality follows by expanding the quadratic and cancelling.

Variance of a set of data points is a useful concept but sometimes we have two separate sets which clearly should be regarded as distinct. For example, in a scatter graph (like you mention - but this explanation works in full generality and even on abstract distributions if you define Ex properly). We need something called covariance to deal with more than one set of data points.

Suppose I give you data on the number of umbrellas bought each day and the number of sandwiches bought in a particular shop. They're unlikely to be related. However, if I give you two data sets which are identical, they are intimitely related to each other. This leads to the concept of correlation, but we need an essential step inbetween.

It should be reasonably intuitive that the generalisation of E((X-EX)2) to two data sets should be E((X-EX)(Y-EY)) and again there are mathematical reasons why this is a particularly good way to measure the error in assuming that the x points are equal to their mean, at the same time as the y points being equal to their mean. This concept is called the covariance and a little rearranging gives
Cov(X,Y)=E((X-EX)(Y-EY))=E(XY)-(EX)(EY)
which should remind you of variance above. Indeed, we could say Var(X)=Cov(X,X).

There's no generalisation to three quantities. Instead, we specify a covaraince matrix, which tells us each pairwise set of covariance.

The Cauchy Schwartz inequality tells us that
E((X-EX)(Y-EY)) <= sqrt(E((X-EX)2)E((Y-EY)2))
ie Cov(X,Y) <= sqrt(Var(X)Var(Y))
so if we consider
r = Cov(X,Y)/sqrt(Var(X)Var(Y)),
this is a number between -1 and 1 which tells us how closely related the distributions are. With a little more theory, it's simple to prove that independent data sets (or distributions) have zero correlation (although the converse is not in general true) and somewhat easier, if X and Y are identical then their correlation is 1.

If you rewrite the above equation using the expansions I've given earlier, you'll find it's identical to your expression but in a much more easily understood form.

Hope that helps - I'm happy to explain anything in more depth if you like.

-Dave