Covariance and Correlation

Forget, for the moment, all that you've learned about regression analysis. Instead, think of how we might have begun our study of relationships, if we had chosen the more modest goal of finding a single number which measures the strength of the linear relationship between a pair of variables. Consider three examples, each involving a population of four individuals:

X Y XY      X Y XY      X Y XY
1 1 1 1 2 2 1 4 4
2 2 4 2 4 8 2 3 6
3 3 9 3 1 3 3 2 6
4 4 16 4 3 12 4 1 4
mean XY = 7.5 mean XY = 6.25 mean XY = 5.0

In the first case, there is a strong upward-sloping relationship between X and Y; in the second case, no apparent relationship; in the third case, a strong downward-sloping relationship. Note the pairwise products: When X and Y are big together, a large product results; when at least one of X or Y is small, the product is not so large. When will the mean product be largest? When the big X's are associated with big Y's (and the little X's with little Y's), i.e., when the relationship is upward-sloping. When will the mean product be smallest? When the big X's are associated with little Y's and the little X's with big Y's, i.e., when the relationship is downward-sloping and there are no large pairwise products. Of course, if X and Y are independent, the mean product is just the product of the means, i.e., E[XY] = E[X]�E[Y].

The covariance of X and Y is the difference between the mean product and the product of the means: Cov(X,Y) = E[XY] - E[X]�E[Y]. In the examples above, the respective covariances are 1.25, 0, and -1.25. Quite generally, positive covariances indicate upward-sloping relationships, and negative covariances indicate downward-sloping relationships.

Covariance is an interesting concept in its own right. A fundamental relationship (frequently encountered in financial analysis) is: Var(X+Y) = Var(X) + Var(Y) + 2�Cov(X,Y). But the units of measurement of covariance are not very natural. For example, the covariance of net income and net leisure expenditures is measured in square dollars.

The correlation of X and Y is the normalized covariance: Corr(X,Y) = Cov(X,Y) / σXσY .

The correlation of a pair of random variables is a dimensionless number, ranging between +1 and -1. It is +1 only for a perfect upward-sloping relationship (where by “perfect” we mean that the observations all lie on a single line), and is -1 for a perfect downward-sloping relationship. In the examples above, the correlations are +1, 0, and -1. The more widely-scattered the (X,Y) pairs are about a line, the closer the correlation is to 0. (Notice that the covariance of X with itself is Var(X), and therefore the correlation of X with itself is 1.)

Correlation is a measure of the strength of the linear relationship between two variables. Strength refers to how linear the relationship is, not to the slope of the relationship. Linear means that correlation says nothing about possible nonlinear relationships; in particular, independent random variables are uncorrelated (i.e., have correlation 0), but uncorrelated random variables are not necessarily independent, and may be strongly nonlinearly related. Two means that that the correlation shows only the shadows of a multivariate linear relationship among three or more variables (and it is common knowledge that shadows may be severe distortions of reality).

The Coefficient of Determination

Consider a “simple” regression model, i.e., one involving only a single independent variable:

Y = α + βX + ε .

In this case, the estimates of the coefficients can be written quite simply:

b = Cov(X,Y) / Var(X) , and  a = Y - bX ,

where X and Y are the sample means of the two variables. (Note that the formula for b is appropriately dimensioned in units of Y per unit of X, and that the formula for a guarantees that the line corresponding to the prediction equation passes through the “group mean” point (X, Y).)

The (unadjusted) coefficient of determination for the regression is the fraction of the variance in Y which is “explained” by the regression:

Var(a+bX) / Var(Y) = b2 σX2 / σY2

= [Cov(X,Y) /σX2]2 σX2 / σY2 = [Corr(X,Y)]2 .

In words: In a simple linear regression, the (unadjusted) coefficient of determination is the square of the correlation between the dependent and independent variables. (Since the symbol “R” is sometimes used to represent the correlation between two variables, the coefficient of determination is sometimes called the “R-square” of a regression.) This provides a natural way to interpret a correlation: Square it, and interpret it as the coefficient of determination of the regression linking the two variables.

Association vs. Causality

Regression analysis can demonstrate that variations in the independent variables are associated with variations in the dependent variable. But regression analysis alone (i.e., in the absence of controlled experiments) cannot show that changes in the independent variables will cause changes in the dependent variable.

Example: In the late 1940s, a nationwide study conducted over several years found a high correlation between the incidence rate of new cases of polio among children in a community, and per capita ice cream consumption in the community. (Equivalently, a simple regression model, using ice cream consumption to predict the rate of occurrence of new polio cases, had a high coefficient of determination.) Fortunately for those of us who like ice cream, a re-examination of the data showed that the high values of both variables occurred in communities where the study collected data in the summertime, and the low values of both occurred in communities where the data was collected during the winter. Polio – which we now know to be a communicable viral infection – spreads more easily when children gather in heterogeneous groups in relatively unsanitary conditions, i.e., it spreads more easily during summer vacation than when the children are in school. The high correlation in no way provided evidence that ice cream consumption causes or promotes polio epidemics.

[Evidence of causality is built upon controlled experimentation. We take as a null hypothesis that some potentially-causal factor (e.g., tobacco consumption) does not have a causal effect on some target factor (e.g., the incidence rate of heart disease, or lung cancer). We then monitor two separate groups of individuals, identical in all other ways, and expose one group to the potentially-causal factor. If we obtain statistically-significant evidence that the target factor differs between the two groups, we infer that the cause of the difference is the factor under investigation.]

Many regression studies are conducted specifically to estimate the effect of some causal factor on some other variable of interest (e.g., the effect of television advertising on sales). This is perfectly legitimate, as long as we remember that the assertion of causality comes from us, outside the regression analysis.

The Normalized Prediction Equation

Recall the prediction equation,

Ypred = a + b1X1 +...+ bkXk .

As in the case of a simple regression, the group mean must satisfy the prediction equation, i.e, a = Y - b1X1 -...- bkXk . Substituting this into the prediction equation to eliminate a, and then rearranging terms and dividing both sides of the equation by σY, yields

(Ypred-Y) / σY = [b1 σX1 / σY] [(X1-X1) / σX1]

      +...+ [bk σXk / σY] [(Xk-Xk) / σXk] .

We have done nothing fancy here: We've merely rewritten the regression equation in a slightly different form. However, this form gives us both an interpretation of the correlation between two random variables (when k=1), and also a definition and interpretation of the beta-weights of the independent variables in a multiple regression (when k > 1).

Regression to the Mean

When k=1, the normalized equation specializes to

(Ypred-Y) / σY = [b σX / σY] [(X-X) / σX] .

Recall that, in a simple linear regression, b = Cov(X,Y)/Var(X). Therefore, the first bracketed expression is simply the correlation of X and Y, i.e.,

(Ypred-Y) / σY = [Corr(X,Y)] [(X-X) / σX] .

In words: If we compute the number of standard deviations an individual's X is different from the mean value of X, and multiply by the correlation between X and Y, we obtain the number of standard deviations different from the mean value of Y our prediction will be for that individual.

While it might not be immediately apparent, this actually offers us an important insight. Assume that the same characteristic of an individual (for example, that individual's level of understanding of statistics) is measured twice, using imperfect measuring devices (for example, two examinations). Let X be the first measurement (exam score), and Y the second. Given that the individual's first measurement is one standard deviation above average, we would predict that the second is only Corr(X,Y) standard deviations above average. As long as the devices do indeed provide some measure of the characteristic being studied, but are less than perfect, this correlation will be positive, but less than one. Consequently, we predict that the second measurement will be less extreme than the first. (Similarly, after observing a lower-than-average measurement the first time, we'll predict a still-lower-than-average, but somewhat higher, measurement the second time.) This phenomenon is known as regression to the mean.

Regression to the mean is often misinterpreted as saying that, in the long run, everyone is mediocre. This is not correct. (Indeed, if we let X be the second measurement, and Y the first, we will predict from knowledge only of the second measurement that the first was somewhat less extreme.) Individuals measure above average for either of two reasons: They are truly above average, or they were lucky. We can't be sure which reason applies. If they are truly above average, we expect another measurement to also be high; if they were merely lucky, we expect the second measurement to be not so high. Taking both possibilities into account, we must moderate our prediction. Others, not so high in their first measurements (either because they truly measure high but were unlucky, or because their true measurement is in fact not so high), may well have higher second measurements (due to better luck the second time), and so the overall distribution of second scores can be just as diverse as the distribution of first scores.

Beta-Weights

In the multiple regression equation

Ypred = a + b1X1 +...+ bkXk ,

the coefficient bi is scaled in units of Y per unit of Xi . Consequently, the coefficients of the independent variables are not directly comparable. (If, for example, Xi were measured in feet and Y in dollars, the coefficient bi would be expressed in dollars/foot. Changing the scale of Xi to miles would increase the coefficient bi by a factor of 5200.)

In contrast, consider the normalized regression equation:

(Ypred-Y) / σY = [b1 σX1 / σY] [(X1-X1) / σX1]

      +...+ [bk σXk / σY] [(Xk-Xk) / σXk] .

The quantity bi σXi / σY is called the beta-weight (or the “normalized regression coefficient”) of Xi . It is a pure (dimensionless) number, and hence the beta-weights of the various independent variables can be compared with one another.

The beta-weight of Xi is the number of standard deviations of change in the predicted value of Y associated with one standard deviation of change in Xi . If we think of the standard deviation of Xi as one unit of “typically-observed” variation in that variable, then the beta-weight of Xi is the amount of variation in Y associated with a typically-observed amount of variation in Xi . In other words, the beta-weights of the independent variables indicate the relative importance of variation in each of the independent variables, in explaining why variation in the dependent variable is observed within the population.

Note that the beta-weights measure the relative importance of the independent variables in explaining something about the population. A variable which varies very little within the population may have a relatively small beta-weight, yet be quite important to include when making predictions for individuals.

Finally, note – just as a curiousity – that in the case of a simple regression, where the beta-weight of the sole independent variable is not of direct interest (since there are no other beta-weights to which it can be compared), it happens to coincide with the correlation between the dependent and independent variables.