Hypothesis Testing

At times we wish to examine statistical evidence, and determine whether it supports or contradicts a claim that has been made (or that we might wish to make) concerning the entire population. This is done in a somewhat asymmetric fashion, analogous to the approach taken in the British system of criminal justice (adopted throughout most of the modern world): We take a statement, presume it to be “innocent,” i.e., true, and ask how strongly the evidence contradicts our initial presumption.

The evidence is viewed as being the result of some statistical procedure. We calculate the probability that the same procedure – if carried out in a world where the statement really is true – would, purely due to sampling error, provide evidence at least as contradictory to the statement on trial as is the evidence we have in fact seen.

This probability, called the significance level of the sample data with respect to the statement, is then interpreted. If it is large, we conclude that the evidence against the statement is weak, since we must acknowledge that, in a presumed world in which the statement is true, our studies would frequently provide such evidence purely due to our exposure to sampling error. However, if this probability is small, we conclude that the evidence at hand is quite different from that which we would expect to see if the statement were true, i.e., we conclude that the evidence strongly argues against the statement’s truth, and we lean towards finding the statement “guilty.”

Just as in a criminal trial, we never conclude that the statement is “innocent” – at most, we find it “not guilty.” In other words, our analysis leaves us in one of two camps: We have strong evidence that the original statement is false, or we do not have such evidence. Therefore, if we wish to make an affirmative case for a claim, we are forced to take the opposite of that claim as the statement we put on trial. Only in this way might we conclude, at the end, that the data – if strong evidence against the claim on trial – serves to support the original claim.

The Goal of Hypothesis Testing

A statement has been made: This is a good investment. The forfeiture rate on loans is more than 10%. The firm does engage in salary discrimination. The food additive does not cause cancer.

You wonder whether to believe the statement. (Presumably, whether you believe it or not will affect your choice of action.) Your “belief” decision will depend on three factors:

Example: You’re the commercial loan officer at a bank, in the process of reviewing a loan application recently filed by a local firm. Examining the firm’s list of assets, you notice that the largest single item is $3 million in accounts receivable. You’ve heard enough scare stories, about loan applicants “manufacturing” receivables out of thin air, that it seems appropriate to check whether these receivables actually exist. You send a junior associate, Mary, out to meet with the firm’s credit manager.

Whatever report Mary brings back, your final decision of whether to believe that the claimed receivables exist will be influenced by the firm’s general reputation, and by any personal knowledge you may have concerning the credit manager. As well, the consequences of being wrong – possibly approving an overly-risky loan if you decide to believe the receivables are there and they’re not, or alienating a commercial client by disbelieving the claim and requiring an outside credit audit of the firm before continuing to process the application, when indeed the receivables are as claimed – will play a role in your eventual decision.

The (modest) goal of hypothesis testing is to reduce the directly-relevant data to a “level of suspicion” based purely on the data. That level of suspicion can then be combined (outside of hypothesis testing) with your assessments of costs and prior beliefs, to help you reach a good “belief” decision. The end result is to fill in the blank below:

“The data, all by itself, makes me __________ suspicious.”
{not at all, slightly, moderately, quite, very, overwhelmingly}

This number is called the significance level of the data, with respect to the statement under investigation (i.e., with respect to the null hypothesis).

Then, we interpret the number: A small value forces us to say, “Either the statement is true, and we’ve been very unlucky (i.e., we’ve drawn a very misrepresentative sample), or the statement is false. We don’t typically expect to be very unlucky, so the data, all by itself, makes us quite suspicious.”

Example: Later in the day, Mary returns from her meeting. She reports that the credit manager told her that there were 10,000 customers holding credit accounts with the firm. He justified the claimed value of receivables by telling her that the average amount due to be paid per account was at least $300.

With his permission, she confirmed (by physical measurement) the existence of about 10,000 customer folders. (You decide to accept this part of the credit manager’s claim.) She selected a random sample of 64 accounts at random, and contacted the account-holders. They all acknowledged soon-to-be-paid debts to the firm. The sample mean amount due was $280, with a sample standard deviation of $120.

Taking the statement to be evaluated as “the true mean amount due is at least $300,” you determine (click here to see how) the statistical significance level of the sample data with respect to this statement to be 9.36%.

The “Null Hypothesis”

Hypothesis testing begins with a statement, known as the null hypothesis. The statement is typically one which says that there is “nothing” special going on, and that “no” dramatic action is called for; hence the name “null.” In the earlier examples, the null hypotheses would probably be that the investment is not good, that the forfeiture rate is less than 10% (assuming that the more dramatic action would be taken if the rate is truly high), that the firm does not discriminate, or that the additive is not a carcinogen. (This is analogous to the legal assumption that the accused is innocent, until guilt is proven. Only strong evidence against the null hypothesis will lead us to go out on a limb and take some dramatic action.)

If a study is being conducted for the purpose of making an affirmative claim, then the null hypothesis must be taken as the opposite (i.e., the negation) of that claim.

Example: We’ve taken the position that we will only force the firm applying for the loan to submit its credit operation to an outside audit if we can claim affirmative evidence that their stated receivables don’t exist. If, instead, we chose to require that they provide affirmative justification of their claim, they would have to take as their null hypothesis that the average amount due per account holder was less than or equal to $300, and then show that their data contradicted that null hypothesis.

Significance Level

The closer this probability is to 0, the more our observations differ from our expectations (under the most favorable assumption concerning the truth of the statement being tested). A significance level very close to zero is interpreted as: “If the statement were true, I would be very unlikely to see the kind of data that I am, in fact, seeing.” So, a close-to-zero significance level means that either (a) the null hypothesis is true, and we’ve been very unlucky in our data collection (i.e., we’ve just drawn a very misrepresentative sample), or (b) the null hypothesis is false. Since we don’t expect to be very unlucky (at least, on a regular basis!), the data, all by itself, makes us suspect that the statement is false.

It is important to note that the (numeric) significance level of the data does not tell us precisely how likely it is that the null hypothesis is true. That probability depends on our prior beliefs, as well.

To interpret the significance level of the data into words, use the following (unofficial) table:

numeric significance level of the data	interpretation: the data, all by itself, makes us
above 20%	not at all suspicious
between 10% and 20%	a little bit suspicious
between 5% and 10%	moderately suspicious
between 2% and 5%	very suspicious
between 1% and 2%	extremely suspicious
below 1%	overwhelmingly suspicious

Example: Linguistically interpolating between “more than a little bit suspicious,” and “at the low end of moderately suspicious,” let’s say that a significance level of 9.36% makes us “a bit suspicious” of the initial claim. What should we now do?

It depends. Remember that our final “belief” decision depends on more than just the data. If, for example, we’ve known the credit manager all our lives – we grew up together, and meet socially on a regular basis – and he has a long-standing reputation as a manager who has never, ever made a statement about business matters under his authority which has not proven to be accurate, we’d probably reason: “The data, all by itself, makes us a bit suspicious, but given everything else we know about the credit manager, we’ll attribute the difference between his claim and what we’ve seen to sampling error (and we’ll believe the claimed receivables exist).” However, if we had outside reason to distrust the credit manager in the first place, we might well reason: “We were disinclined to believe anything he’d say in the first place. On top of that, the data all by itself makes us a bit suspicious.” And we’d probably ask for an outside audit. Note that in this case, we are allowing other knowledge to override what the data suggests.

On the other hand, imagine that Mary had come back reporting a sample mean of $260 instead. Now, the significance level of the data with respect to the original claim is 0.49%. The data, all by itself, makes us overwhelmingly suspicious, i.e., stands in overwhelmingly-strong contradiction to the original claim. In this case, obviously we wouldn’t believe the credit manager if he already had a bad reputation. But even if he were a trusted life-long friend, we might at least plan to meet him for lunch, to gently ask if he had any recent personal problems pulling his attention away from his work, and might have accidentally overstated the true mean credit balance due on customer accounts when asked to make the estimate used on the firm’s loan application. Under normal circumstances, we would have been inclined to believe his claim, but the data, all by itself, is enough to override our original inclinations.

Language

If the significance level of the data is (very) close to zero, then the data is said to be (highly) “statistically significant”.Yes, this is bizarre – but it is common terminology.

An alternative to reporting the precise significance level of the data is to bracket it: “The data is significant at the x% level, but not at the y% level,” means that the actual significance level of the data is less than x%, and greater than y%.

Statisticians will speak of a “critical significance level,” which summarizes all of your knowledge of relevant costs, as well as your prior beliefs. They will then tell you to “reject” the null hypothesis (i.e., decide to disbelieve the statement) if the significance level of the data is below this critical level, and to “accept” it otherwise. But they have great difficulty in telling you how the appropriate critical significance level can be determined. Real-world managerial usage therefore conforms closely to the presentation I’ve given above (i.e., you determine how suspicious the data, all by itself, makes you, and use this to help you reach a decision), rather than to the statisticians’ idealized view (in which you take yourself out of the picture before even looking at the data).

Practical Significance vs. Statistical Significance

Consider a “knife-edge” null hypothesis, such as “these two credit agencies experience the same mean default rate.” Even before any data is examined, we are certain that the null hypothesis is false, i.e., that the mean default rates are not exactly equal. But they might still be close enough for us to consider them equal for all practical purposes. When our study (testing the difference between the means) is complete, we will find ourselves in one of four situations:

under the listed circumstances, you should (subject, of course, to what your prior beliefs tell you) be inclined to ...		statistical significance (of the observed difference between sample means)
		high	low
practical significance (of the observed difference between sample means)	high	reject the null hypothesis	collect more data (if the difference is real, it matters, but there’s not yet enough evidence to be sure it’s real)
	low	accept the null hypothesis (a large sample managed to [statistically] detect a [practically] unimportant difference)	accept the null hypothesis

The two off-diagonal entries reflect the facts that (lower left) with enough data, you can detect differences that are so small as to have no practical impact, and (upper right) if you see a meaningful difference, but can't be sure that it's real, more data can resolve the issue (either the difference will go away, or it will persist and the evidence that it's real will strengthen).

Warning: If you give yourself enough chances to be unlucky, some of them will pay off

Beware a study that collects data on a zillion different variables, and then looks for evidence of relationships via hypothesis testing. (This is sometimes refered to as “shotgun statistics.”) A consulting statistician to a national anti-nuclear-power group provided a very bad example a few years ago. He would collect health data (infant mortality rates, cancer rates, incidence of pneumonia, colds, influenza, percentage of women between the ages of 35 and 39 who slip in the shower, and the like – literally thousands of different health statistics) on communities located near nuclear power plants, and on other communities which were more remote.

He would then do thousands of “tests of differences between means”, each based on the null hypothesis that people living near power plants were at least as healthy (on some given dimension) as were those living further away. He would find that his data was highly significant with respect to a few of the null hypotheses, and headlines would appear: “People who live near nuclear power plants have a significantly higher number of sore throats!”

The dishonesty is that, if you were to test 1000 null hypotheses, all of which were indeed true, you would still expect to obtain data significant at the 1% level (i.e., data strongly contradicting the null hypothesis) about 10 times, just because of sampling error!

More Language: Names of Hypothesis-Testing Procedures

There are statistical procedures useful for testing many types of hypotheses. In most commercial statistics packages, they are listed by the names in the right-hand column below.

if your null hypothesis is that	then you perform a
μ = something	two-tailed t-test
μ ≥ something, or μ ≤ something	one-tailed t-test
μ₁ = μ₂	two-tailed test of differences, or of paired differences
μ₁ ≥ μ₂, or μ₁ ≤ μ₂	one-tailed test of differences, or of paired differences
the qualitative variables X and Y are independent	chi-square test of independence
the quantitative variable Y is not related to the qualitative variables X₁,...,X_k	analysis of variance
X has a specified distribution	chi-square test of goodness of fit