Mantis Technologies

Regression-Oriented Exam Questions

Please answer the following questions on the basis of what you learned from your analysis of the experimental data collected by Mantis Technologies.

Part I:

We’ll start with a simple subset of the Mantis Technologies data. After bringing the data into KStat, please insert a blank row at Row 32 of the “Data” page. This should leave you working with just the first 30 observations, i.e., office workers who were given classroom training. We’ll also just work with the three basic variables LnxProd, MSProd, and Sex.

Give a 95%-confidence interval (using this sample of 30 workers) for the mean productivity of workers under the current (Microsoft-based) software setup.
Roughly how large a sample would be required to make the preceding estimate with a margin of error (at the 95%-confidence level) of only 2 productivity points?
Are the male office workers, on average, currently more or less productive under Microsoft Windows than the female workers? How strong is the evidence that the Windows-productivity difference between the sexes is a real difference?
What fraction of the variance in Linux productivity after classroom training can be explained by the facts that Windows productivity varies across the workers, and that some of the workers are male and others female?
What Linux productivity score would you predict for a female worker with a current Microsoft Windows productivity score of 50, after classroom training?
Give a 95%-confidence interval for your answer to (5).
What is the average productivity score using Linux (after classroom training) that you expect to get from workers with a current Microsoft Windows productivity score of 60?
Give a 95%-confidence interval for your answer to (7).
Give a 95%-confidence interval for the average increase in productivity score (in moving from Microsoft Windows to Linux) of workers put through the classroom training program.

After regressing LnxProd onto MSProd and Sex, look at the “Model Analysis” page.

Five workers are flagged as having high leverage. Why? (Answer as precisely as you can, in terms of the real world [i.e., I don’t just want a definition of leverage], with a single sentence.)

Part II:

Please restore the original dataset, with all 90 observations. You should be able to do this by simply removing the blank row you inserted earlier. For this section, we’ll ignore Sex and Age in our analyses.

After regressing LnxProd onto MSProd, at what graph would you look (other than at the direct scatterplot of the two variables) to see if there is any sign of a nonlinear relationship involving MSProd?
Create a new variable (from MSProd) to help capture the nonlinearity. Regress LnxProd onto both MSProd-based variables. How strong is the evidence that this new variable belongs in your model? (Cite a relevant significance level.)
At what level of current (Microsoft Windows) productivity does the predicted Linux productivity peak?
Using the model in (2), predict the Linux productivity for a worker with a current Windows productivity of 30.
Taking both MSProd-based variables and the type of training program into account, how strong is the evidence that the type of training program makes a difference in predicting productivity using Linux? (Cite a relevant significance level, and say where you found it.)
Using the same regression model as in (5), give a 95%-confidence interval for the average difference in Linux productivity you'd expect to see if workers were trained using videotapes rather than in a classroom.

Part III:

For the rest of our analysis, we'll throw out the last 30 observations, which correspond to videotape training. (Do this by going to your “Data” page and inserting a blank row at Row 62.) This should leave you with 60 observations. Since we’re just looking at classroom-based and Web-based training programs, the “Video” column of data will be irrelevant in all further analysis: The “Web” column will represent the difference between the two remaining types of program.

Still looking just at the two MSProd-based variables, and at the type of training program (classroom vs. Web-based), how strong is the evidence that there is a real difference between the effectiveness of the two types of training programs? (Cite a relevant significance level.)

You realize that there's still a chance to learn something useful: Perhaps the age and/or the sex of a worker plays a role in determining which of the two remaining training programs will be more effective at generating high Linux productivity for that worker. In order to investigate this issue, you create (by combining the age and sex variables with the training-program dummy variable) two new variables. You are now considering a regression model which regresses LnxProd onto the two MSProd-based variables, Age, Sex, the training-program dummy variable ("Web"), and your two newly-created variables.

Looking at the regression results, you decide to remove the pure "Age" variable from your model. After further consideration, you also remove the newly-created variable which examined whether sex plays a role in determining which program is more effective. You are left with five independent variables: the two MSProd-based variables, Sex, Web, and the variable used to examine whether the effectiveness of the two training programs varies according to the age of the worker.

Regress LnxProd onto these five variables. At what age should you draw the dividing line between workers sent to the classroom program, and workers trained using the Web-based program? (Give the "critical" age at which you will separate the workers, to two decimal places.)
Assume that you will assign Mantis' office workers to one of the two training programs using the age-based criterion you just specified. How many observations from your sample of 60 can you ultimately use to estimate the mean productivity gain that will be yielded by the transition from Microsoft Windows to Linux?