Scanning the Internet for statistical inspiration one day, I found the BOSTON1.XLS dataset, which reports the median value of owner-occupied homes in about 500 U.S. census tracts in the Boston area, together with several variables which might help to explain the variation in median value across tracts. (In each of the "Boston" datasets, cell comments at the top of each column explain the units of measurement of the variables.) I looked at a scatterplot of median value against average number of rooms per home, and the first painfully-obvious fact I noted was that the median-value variable had been artificially set to $50,000 for a number of observations. I sorted the dataset on the MV column and removed all observations with MV = 50, yielding BOSTON2.XLS.

I took a quick look at the correlations between the pairs of variables, and noticed without surprise that the single highest correlation was between level of industrialization and air pollution, and the third-highest was between level of industrialization and tax rate.

I next performed a regression of median value onto all six potential independent variables. Due, most likely, to colinearity (i.e., the fact that industrialization, air pollution, and taxes are highly correlated and therefore tell pretty much the same story in the sample), level of industrialization and pollution contributed little to the model. I chose to drop them from further consideration. There was strong evidence that, in the model including the four remaining independent variables, tax rate had a non-zero effect. However, it played the relatively-least-important explanatory role, and removing it from the model didn't lessen the overall explanatory power of the model very much. I decided to drop it, as well, in order to simplify the rest of the analysis. (I clicked at the top of each of the three relevant columns on the "Data" page and selected "Delete".)

After doing a regression with only the three remaining independent variables, I looked at the RM residual plot (i.e., Charts / Residual plots / RM). A definite U-shaped relationship seemed to be there. In addition, one observation was an extreme outlier on the graph (at the lower right). On further investigation, that outlier had by far the largest rooms/home level in the dataset. Perhaps it was a typographical error. I chose to exclude it (I sorted the data on the RM column, and deleted that observation, which sorted into the last row) - subject to further investigation, of course, if I had access to additional data on each tract.

I finally added RM-SQUARED to the dataset, in order to capture the U-shaped component of the relationship. This yielded BOSTON.XLS.

“I next performed a regression of median value onto all six potential independent variables. Due, most likely, to colinearity (i.e., the fact that industrialization, air pollution, and taxes are highly correlated and therefore tell pretty much the same story in the sample), level of industrialization and pollution contributed little to the model.”

1. Where did I look to determine that they contributed little to the explanatory power of the model?

“I chose to drop them from further consideration. There was strong evidence that, in the model including the four remaining independent variables, tax rate had a non-zero effect.”

2. Where did I look to find that such strong evidence existed?

“However, it played the relatively-least-important explanatory role, and removing it from the model didn't lessen the overall explanatory power of the model very much.”

3. Where did I look to see that it played the least-important explanatory role?

4. Give a rough 95%-confidence interval for the predicted median home value in a tract where the average house size is 7 rooms, the pupil/teacher ratio is 20, and 5% of the families are of lower income status.

5. Give a 95%-confidence interval for the mean decrease in median market value within a census tract associated with each unit of increase in pupil/teacher ratio.

6. Sketch, as best you can, the relationship between average-number-of-rooms-per-residence and median-market-value as seen in the regression.

7. If you suspected that the impact of pupil-teacher ratio on median home value was greater for communities with relatively few low-income families than for communities with relatively many low-income families, what additional variable would you consider adding to your model?

8. (Continuation) What would you expect the sign of the coefficient of that new variable to be?

9. (Continuation) Does the data confirm your expectations?

(Answers to these questions are on the second tab of BOSTON.XLS.)