*Regression analysis* is a statistical technique for studying linear relationships. ^{[1]} It begins by supposing a general form for
the relationship, known as the *regression model*:

Y = α + β_{1}X_{1} +...+
β_{k}X_{k} + ε .

* Example:* In the motorpool case, the manager of the
motorpool considers the model

Cost = α + β_{1}Mileage + β_{2}Age +
β_{3}Make + ε .

Y is the *dependent variable*, representing a quantity that varies from individual to
individual throughout the population, and is the primary focus of interest. X_{1},...,
X_{k} are the *explanatory variables* (the so-called “independent
variables”), which also vary from one individual to the next, and are thought to be related
to Y. Finally, ε is the *residual term*, which represents the composite effect of all
other types of individual differences not explicitly identified in the model. ^{[2]}

Beside the model, the other input into a regression analysis is some relevant sample data, consisting of the observed values of the dependent and explanatory variables for a sample of members of the population.

The primary result of a regression analysis is a set of estimates of the *regression
coefficients* α, β_{1},..., β_{k}. These estimates are made by
finding values for the coefficients that make the average residual 0, and the standard deviation of
the residual term as small as possible. The result is summarized in the *prediction
equation*:

Y_{pred} = a + b_{1}X_{1} +...+
b_{k}X_{k} .

* Example:* Fitting the model above to the motorpool data, we obtain:

Cost_{pred} = 107.34 + 29.65 Mileage + 73.96 Age + 47.43 Make .

(Dive down for further discussion of the assumptions underlying regression analysis, or examine a workbook which illustrates some of the underlying computations.)

Typically, a regression analysis is done for one of two purposes: In order to predict the value of the dependent variable for individuals for whom some information concerning the explanatory variables is available, or in order to estimate the effect of some explanatory variable on the dependent variable.

If we know the value of several explanatory variables for an individual, but do not know the value of that individual’s dependent variable, we can use the prediction equation (based on a model using the known variables as its explanatory variables) to estimate the value of the dependent variable for that individual.

In order to see how much our prediction can be trusted, we use the *standard error of the
prediction* ^{[3]} to construct confidence
intervals for the prediction. (Examine a workbook that provides
a detailed discussion of the standard error of the prediction.)

* Example:* In order to predict the next twelve-month’s maintenance and repair
expenses for a specific one-year-old Ford currently in the motorpool, we’d first perform a
regression analysis using age and make as the explanatory variables:

Cost_{pred} = 705.66 + 8.53 Age - 54.27 Make .

Our prediction will then be $714.19, and the margin of error (at the 95%-confidence level) for the prediction is 2.1788 × 124.0141 = $270.20 .

If our goal is not to make a prediction for an individual, but rather to estimate the mean value
of the dependent variable across a large pool of similar individuals, we use the *standard error
of the estimated mean* instead when computing confidence intervals.

* Example:* Our estimate of the average cost of keeping one-year-old Fords working is
$714.19, with a margin of error of 2.1788 × 41.573 = $90.58 .

In order to estimate the “pure” effect of some explanatory variable on the dependent variable, we want to control for as many other effects as possible. That is, we’d like to see how our prediction would change for an individual if this explanatory variable were different, while all others aspects of the individual were kept the same. In order to do this, we should always use the most complete model available, i.e., we should include all other relevant factors as additional explanatory variables. (Dive down for further discussion of when to use the “most complete” model, and when to use a smaller model.)

Our estimate of the impact of a unit difference in the targeted explanatory variable is its
coefficient in the prediction equation. The extent to which our estimate can be trusted is measured
by the *standard error of the coefficient*.

* Example:* Using the full regression model, we estimate that the mean marginal
maintenance and repair cost associated with driving one of the cars in the motorpool an additional
1000 miles is $29.65, with a margin of error in the estimate of 2.2010 × 3.915 = $8.62 . To
better understand why we use the most complete model available, note that any “one of the
cars” has a particular age and make, and we want to hold those constant while considering the
incremental effect of another 1000 miles of driving.

Given a specific model, one might wonder whether a particular one of the explanatory variables really “belongs” in the model; equivalently, one might ask if this variable has a true regression coefficient different from 0 (and therefore would affect predictions).

We take the standard approach of classical hypothesis testing: In order to see if there is
evidence supporting the inclusion of the variable in the model, we start by hypothesizing that it
does *not* belong, i.e., that its true regression coefficient is 0.

Dividing the estimated coefficient by the standard error of the coefficient yields the *
t-ratio* of the variable, which simply shows how many standard-deviations-worth of sampling
error would have to have occurred in order to yield an estimated coefficient so different from the
hypothesized true value of 0. We then ask how likely it is to have experienced so much sampling
error: This yields the significance level of the sample data with respect to the null hypothesis
that 0 is the true value of the coefficient. The closer this significance level is to 0%, the
stronger is the evidence *against* the null hypothesis, and therefore the stronger the
evidence is that the true coefficient is indeed different from 0, i.e., that the variable does
belong in the model.

* Example:* In the full model, the significance level of the t-ratio of mileage is
0.0011%. We have overwhelmingly strong evidence that mileage has a true non-zero effect in the
model. On the other hand, the significance level of the t-ratio of make is only 12.998%. We have
here only a little bit of evidence that the true difference between Fords and Hondas is nonzero.
(If we really wish to make a case against Hondas, we’ll require that the estimated difference
persist as the sample size is increased, i.e., as more evidence is collected.)

Why does the dependent variable take different values for different members of the population?
There are two possible answers: “Because the explanatory variables vary.”
“Because things still sitting in the residual term vary.” The total variation seen in
the dependent variable can be broken down into these two components, and the *coefficient of
determination* ^{[4]} is the fraction of the total
variation that is explained by the model, i.e., the fraction explained by variation in the
explanatory variables. Subtracting the coefficient of determination from 100% indicates the
fraction of variation in the dependent variable that the model fails to explain.

* Example:* Looking at mileage alone, it can explain 56% of the observed car-to-car
variation in annual maintenance costs. Looking at age alone, it can’t explain much of
anything. But variations in mileage and age together can explain over 78% of the variation in
costs. The reason they can explain more together than the sum of what they can explain separately
is that mileage masks the effect of age in our data. When both are included in the regression
model, the effect of mileage is separated from the effect of age, and the latter effect then can be
seen.

A natural follow-up is to ask what the *relative* importance of variation in the
explanatory variables is in explaining observed variation in the dependent variable. The *
beta-weights* ^{[5]} of the explanatory variables
can be compared to answer this question. ( Dive down for a discussion of
the distinction between t-ratios and beta-weights.)

* Example:* In the full model, the beta-weight of mileage is roughly twice that of
age, which in turn is more than twice that of make. If asked, “Why does the annual
maintenance cost vary from car to car?” one would answer, “Primarily because the cars
vary in how far they’re driven. Of secondary explanatory importance is that they vary in age.
Trailing both is the fact that some are Fords and others Hondas, i.e., that make varies across the
fleet.”

The six “steps” to interpreting the result of a regression analysis are:

- Look at the prediction equation to see an estimate of the relationship.
- Refer to the standard error of the prediction (in the appropriate model) when making predictions for individuals, and the standard error of the estimated mean when estimating the average value of the dependent variable across a large pool of similar individuals.
- Refer to the standard errors of the coefficients (in the most complete model) to see how much you can trust the estimates of the effects of the explanatory variables.
- Look at the significance levels of the t-ratios to see how strong is the evidence in support of including each of the explanatory variables in the model.
- Use the “adjectived” coefficient of determination to measure the potential explanatory power of the model.
- Compare the beta-weights of the explanatory variables in order to rank them in order of explanatory importance.

[1] Why is it valuable to be able to unravel linear
relationships? Some interesting relationships *are* linear, essentially all managerial
relationships are at least locally linear, and several modeling tricks help to transform the most
commonly-encountered nonlinear relationships into linear relationships.

[2] The dependent and explanatory variables, as well as the residual term, can be thought of as random variables resulting from the random selection of a single member of the population, i.e., as quantities that vary from one individual to the next.

[3] The standard error of the prediction takes into account
both our exposure to error in using a value of 0 for the individual’s residual when making
the prediction (measured by the *standard error of the regression*), and our exposure to
sampling error in estimating the regression coefficients (measured by the *standard error of the
estimated mean*).

[4] The coefficient of determination is sometimes called the “R-square” of the model. Some computer packages will offer two coefficients of determination, one with an adjective – “adjusted”, “corrected”, or “unbiased” – in front. Given the choice, use the one with the adjective. If it is somewhat less than zero, read it as 0%.

[5] The beta-weight of an explanatory variable has the same sign as the estimated coefficient of that variable. It is the magnitude, i.e., absolute value, of the beta-weight that is of relevance.