Statistics, quite simply, is about learning from sample data. You face a group of individuals – perhaps people, but maybe cans of tomatoes, or automobiles, or fish in a lake, or even something as nebulous as calendar weeks. This group is the population of interest to you. There is something you would like to know about this population: How likely are the people to try a new product you are thinking of bringing to the market? Are the cans properly sealed? What determines the cost of keeping the cars in working condition? How many fish are there? What will demand for your product be in the weeks to come? The answer to your question will guide you in making a decision.
If you could simply collect data from all the members of your population, you would know what you need to know. However, there can be many reasons why this might not be possible. It might be too expensive: If the potential purchasers of your product are all the adult consumers in the United States, the sheer size of the population makes contacting every individual prohibitively costly. It may be that collecting data does direct damage: If you open all the cans of tomatoes to test the contents, you have nothing left to sell. More subtly, the population is often somewhat ill-defined. If you manage a fleet of automobiles, you might consider the population of interest to be cars actually in your fleet in recent months, together with cars potentially in your fleet in the near future. In this case, some members of the population are not directly accessible to you.
Indeed, this is the most common context for the use of statistics by managers: The recent past is viewed as a sample from a population that includes the soon-to-come future as well. To the extent that the future is not qualitatively very different from recent past, we can use the sample data to make inferences about what the future will look like.
For any of these reasons, you might find yourself unable to examine all members of the population directly. So, you content yourself with collecting data from a sample of individuals drawn from the population. Your hope is that the sample is representative of the population as a whole, and therefore anything learned from the sample will give you information concerning the entire population, and will consequently help you make your decisions.
All statistical studies are carried out by following some statistical procedure, and every statistical procedure has three elements: You must specify how the sample data will be collected and how much data will be collected, and what you’ll do with the data once it’s in hand.
As a simple example, consider the following estimation procedure:
If you were facing a decision problem in which the “best” decision depended on the population mean income, you might now use your estimate to guide your decision.
In the example above, the estimate we obtain might – if we are incredibly lucky – be precisely equal to the population mean. However, it will probably be the case that the sample is not perfectly representative of the population, and our sample mean is somewhat different from the true population mean. The possibility that the sample might fail to be perfectly representative is called exposure to sampling error. How far off is our estimate from the truth? Using only the data at hand, we can’t say. (If we could, we’d simply correct the estimate!) Instead, we focus our attention on the procedure used to make the estimate, and we determine how exposed to sampling error we were in using that procedure.
Consider a population of 70 individuals, for which you wish to estimate the mean annual income, μ. Assume that you have decided to use the following estimation procedure to make your estimate:
After you carry out this procedure, the box will contain a number, e.g., $47,530.
However, imagine yourself standing at a particular moment in time, after you have committed to carrying out the procedure but before it is actually implemented. Peer into your future, and look inside the box. What do you see?
You don't see a specific number, since the procedure hasn't yet been carried out. But you see more than an empty box, since you know that there will soon be a specific number there: You see the potential for a number to appear. Indeed, there are 705 = 1,680,700,000 different ways the procedure might eventually play out, and each will yield a specific number in the box. Therefore we can assert that the eventual content of the box has probability 1/1,680,700,000 of being each of many different values.
Such a potential, having specific probabilities of yielding specific values, is called a random variable.
The Fundamental Concept (underlying all statistical analysis): Anything of interest that can be said about a statistical procedure is equivalent to some mathematical statement concerning the random variable which represents the end result of the procedure. In particular, our exposure to sampling error when using a procedure can be measured by studying the associated random variable.
At some point in your past, you have likely heard the fields of probability and statistics glued together, as “probabilityandstatistics.” The two fields are actually quite different: The domain of probability is the study of uncertainty, and the domain of statistics is the use of sample data to make inferences about a population. However, sampling involves randomnesss, i.e., uncertainty, and therefore the tools from probability can be applied to help us understand our exposure to sampling error.
Imagine, for example, that we had planned to make our estimate of the population mean annual income using the following alternative procedure:
Undoubtedly you would be somewhat uncomfortable about using this estimation procedure. But why? It's not that the procedure will yield too low an estimate: If the sample happens, by chance, to consist of five individuals who all earn more than the actual population mean, this procedure will actually yield too high an estimate (and indeed will yield an estimate closer to the truth than the procedure listed earlier, which uses the sample mean as the estimate). Rather, your discomfort is because you'd expect the procedure to yield an underestimate. Let Y be the random variable corresponding to the end result of the procedure. Applying the Fundamental Concept, the flaw in this procedure can be stated very precisely: As long as there are at least two individuals in the population with different incomes, E[Y] < μ.
Call the end result of the original procedure X. In developing the language of estimation, we'll see that X has an expected value of precisely μ, no matter what the distribution of incomes across the population might be. The Fundamental Concept thus helps us to see when one statistical procedure has more-desirable properties than some other procedure.
How subject are we to sampling error when we use the original procedure to make an estimate? We “expect” the procedure to give us the correct result. But any one time that the procedure is carried out, the actual result might differ somewhat from μ. The standard deviation of X, called the standard error of the estimate, measures how far from μ the procedure’s result will “typically” be, and hence is a direct measure of our exposure to sampling error.
Return to the main discussion.