Statistics, quite simply, is about learning from sample data. You face a group of individuals – perhaps people, but maybe
cans of tomatoes, or automobiles, or fish in a lake, or even something as nebulous as calendar
weeks. This group is the **population** of interest to you. There is something you would like to
know about this population: How likely are the people to try a new product you are thinking of
bringing to the market? Are the cans properly sealed? What determines the cost of keeping the cars
in working condition? How many fish are there? What will demand for your product be in the weeks to
come? The answer to your question will guide you in making a decision.

If you could simply collect data from all the members of your population, you would know what you need to know. However, there can be many reasons why this might not be possible. It might be too expensive: If the potential purchasers of your product are all the adult consumers in the United States, the sheer size of the population makes contacting every individual prohibitively costly. It may be that collecting data does direct damage: If you open all the cans of tomatoes to test the contents, you have nothing left to sell. More subtly, the population is often somewhat ill-defined. If you manage a fleet of automobiles, you might consider the population of interest to be cars actually in your fleet in recent months, together with cars potentially in your fleet in the near future. In this case, some members of the population are not directly accessible to you.

**Indeed, this is the most common context for the use of statistics by managers: The recent past is viewed as a sample from a population that includes the soon-to-come future as well. To the extent that the future is not qualitatively very different from recent past, we can use the sample data to make inferences about what the future will look like.**

For any of these reasons, you might find yourself unable to examine all members of the
population directly. So, you content yourself with collecting data from a **sample** of
individuals drawn from the population. Your hope is that the sample is representative of the
population as a whole, and therefore anything learned from the sample will give you information
concerning the entire population, and will consequently help you make your decisions.

All statistical studies are carried out by following some **statistical procedure**, and
every statistical procedure has three elements: You must specify *how* the sample data will be
collected and *how much* data will be collected, and *what* you’ll do with the data
once it’s in hand.

As a simple example, consider the following *estimation procedure*:

- An individual will be selected at random from the population, with every member of the
population having an equal chance to be chosen. Relevant information will be obtained from the
selected individual, who then will be returned to the population. (This method of selecting
individuals is known as
*simple random sampling with replacement*.) - The process described above will be repeated 5 times.
- Assume, for purposes of this example, that the individuals are people, and that we obtain from
each sampled individual his or her gross income over the previous twelve months. We will then
average the twenty observations, and use this average (the
*sample mean*) as an estimate of the average across all members of the population (the*population mean*).

If you were facing a decision problem in which the “best” decision depended on the population mean income, you might now use your estimate to guide your decision.

In the example above, the estimate we obtain might – if we are incredibly lucky – be
precisely equal to the population mean. However, it will probably be the case that the sample is
not perfectly representative of the population, and our sample mean is somewhat different from the
true population mean. The possibility that the sample might fail to be perfectly representative is
called **exposure to sampling error**. How far off is our estimate from the truth? Using only
the data at hand, we can’t say. (If we could, we’d simply correct the estimate!)
Instead, we focus our attention on the procedure used to make the estimate, and we determine how
exposed to sampling error we were in using that procedure.

Consider a population of 70 individuals, for which you wish to estimate the mean annual income, μ. Assume that you have decided to use the following estimation procedure to make your estimate:

- simple random sampling with replacement
- a sample size of 5
- compute the sample mean, write it in the box below, and use it as the estimate:

After you carry out this procedure, the box will contain a number, e.g., $47,530.

However, imagine yourself standing at ** a particular moment in time**,

You don't see a specific number, since the procedure hasn't yet been carried out. But you see
more than an empty box, since you know that there will soon be a specific number there: You see the
*potential* for a number to appear. Indeed, there are 70^{5} = 1,680,700,000 different
ways the procedure might eventually play out, and each will yield a specific number in the box.
Therefore we can assert that the eventual content of the box has probability 1/1,680,700,000 of
being each of many different values.

Such a potential, having specific probabilities of yielding specific values, is called a **
random variable**.

** The Fundamental Concept (underlying all statistical analysis):** Anything of
interest that can be said about a statistical procedure is equivalent to some mathematical
statement concerning the random variable which represents the end result of the procedure. In
particular, our exposure to sampling error when using a procedure can be measured by studying the
associated random variable.

At some point in your past, you have likely heard the fields of probability and statistics glued together, as “probabilityandstatistics.” The two fields are actually quite different: The domain of probability is the study of uncertainty, and the domain of statistics is the use of sample data to make inferences about a population. However, sampling involves randomnesss, i.e., uncertainty, and therefore the tools from probability can be applied to help us understand our exposure to sampling error.

Imagine, for example, that we had planned to make our estimate of the population mean annual income using the following alternative procedure:

- simple random sampling with replacement
- a sample size of 5
- take the smallest of the five observations as the estimate, and write it in the box below:

Undoubtedly you would be somewhat uncomfortable about using this estimation procedure. But why?
It's not that the procedure *will* yield too low an estimate: If the sample happens, by
chance, to consist of five individuals who all earn more than the actual population mean, this
procedure will actually yield too high an estimate (and indeed will yield an estimate closer to the truth than the procedure listed earlier, which uses the sample mean as the estimate). Rather, your
discomfort is because you'd *expect* the procedure to yield an underestimate. Let Y be the
random variable corresponding to the end result of the procedure. Applying the Fundamental Concept,
the flaw in this procedure can be stated very precisely: As long as there are at least two individuals in the population with different incomes, E[Y] < μ.

Call the end result of the original procedure X. In developing the language of estimation, we'll see that X has an expected value of precisely μ, no matter what the distribution of incomes across the population might be. The Fundamental Concept thus helps us to see when one statistical procedure has more-desirable properties than some other procedure.

How subject are we to sampling error when we use the original procedure to make an estimate? We “expect”
the procedure to give us the correct result. But any one time that the procedure is carried out, the actual
result might differ somewhat from μ. The standard deviation of X,
called *the standard error of the estimate*, measures how far from μ the procedure’s result
will “typically” be, and hence is a direct measure of our exposure to sampling error.

Return to the main discussion.