8.6 – Step 2: Establish the Criterion for a Decision
2. Establish the Criterion for a Decision
Once we have stated the null hypothesis, which is a prediction of a lack of treatment effect or relationship, we will need to determine what research results would make us question whether the null hypothesis is the best explanation for the results of our study. To do this, we would think about what kinds of results we would expect to get if the null hypothesis were true and what kinds of results we would be less likely to expect to get if the null hypothesis were true. Then, we determine a cut-off point where we are comfortable saying that any research result short of this cut-off point is consistent with the null hypothesis, and any result equal to or greater than the cut-off point is not consistent with the null hypothesis. Essentially, we will “draw a line in the sand” and then see if our research results cross that line.
In setting this criterion, we will use probability. Specifically, we are going to specify what is called the significance level, or the “alpha level,” depicted by α (the Greek letter, “alpha”), which is the probability of rejecting the null hypothesis, H0, if the null hypothesis is true (H0 = true). In other words, it is the probability of claiming that there is an effect, difference, or correlation when there actually isn’t one.
It is important to remember that this is a conditional probability. It is the probability of one thing given another thing. In this case, it is the probability of rejecting the null hypothesis (claiming an effect, difference, or correlation), given that the null hypothesis is true (there is actually no effect, difference, or correlation). It would be depicted as: “P (reject H0 | H0 = true).”
Conventionally, α is usually set at 0.05 or 0.01, with 0.05 being the most common. This will also, for the most part, be the case for examples in this textbook.
When α = 0.05, we are setting our criterion or cut offcritical region at the point at which there is a probability of 0.05, or 5% chance, of rejecting the null when the null is true.
Once the alpha level is selected, it can then be used to determine the critical value of the test statistic and thus the cutoff point for the critical region (also known as the “region of rejection“). This critical value or values (depending on whether you are doing a one-tailed test or two-tailed test) are the “lines in the sand” that we have determined are the point at which the result would no longer be considered consistent with the null hypothesis and would lead us to think that the null hypothesis is “probably wrong.”
As an example, let’s say that researchers wanted to explore whether sleep deprivation impacts short-term memory. Specifically, they believe that sleep deprivation will impair short-term memory. To test their hypothesis, they designed a research study. They invited 100 people to their lab to spend the night. Participants were woken up every hour throughout the night. Then in the morning, they were asked to complete the Saenz Short-Term Memory Test (SSTMT), which under normal circumstances is normally distributed and has a mean of μ = 50 and a standard deviation of σ = 10.
Because the scientific method requires researchers to test the null hypothesis, we can then explore what it would look like if these 100 people were to take the SSTMT without any impairment. Because the SSTMT is normally distributed and has a mean of μ = 50 and a standard deviation of σ = 10, we can use the Central Limit Theorem to predict all the possible sample means for samples of 100 people.
The Central Limit Theory will predict the following about these sample means:
- They will form a roughly normal, or bell-shaped, distribution.
- The mean of all these SSTMT sample means will be roughly the same as the population mean of the SSTMT (μ = 50).
- The standard deviation of these sample means can be calculated as the standard error, equal to: A statistic that examines research data to test hypotheses in null hypothesis significance testing (NHST).
Thus, we can create a graph of the distribution of all of the possible sample means as follows:
Remember that this distribution of sample means depicts all of the possible sample means we could get from samples of 100 people who were given the SSTMT without any treatment effect. We expect most people to score around 50 on the SSTMT because that is the average score. However, we will also have people score above or below the mean because of their relatively poor or good memory skills. Thus, if we think of this graph as a distribution of many sample means, with each sample mean represented by a box, it would look like this:
We can see that there are a lot of sample means near 50, and as you get farther from 50, you see fewer and fewer sample means. Below, for example, you can see one sample mean score of 51.25 highlighted in the distribution:
Now that we can visualize all of the possible sample means that we could get if there is no treatment effect (the null hypothesis is true; in other words, H0: μsleep deprivation = 50), we can use that distribution to “draw a line in the sand” that delineates where we start to get sample means that really make us question whether the null hypothesis is actually true.
Sample means near 50 are consistent with the null hypothesis because that is exactly what we would predict a sample mean would likely be if there were no treatment effect. Why? Think of it this way. We know that people normally score around 50 on the SSTMT (we are told this when we are told that the SSTMT has a population mean of μ = 50). Then, if we took a sample of n = 100 people and gave them the SSTMT without doing anything to them (in other words, no treatment effect), we would expect the mean of those 100 people to be pretty close to 50 on the test.
On the other hand, if we get a sample mean that is far away from 50, that might start to make us wonder if something weird is going on. By now, hopefully, we know that sampling error exists, so we should not expect all our sample means to be exactly 50. Sample means will differ somewhat from 50 in either direction because of sampling error.
Yet, by calculating the standard error (σM), we actually know the average amount of sampling error that we can expect. Based on our calculation of standard error with a sample of 100 individuals, we can expect about 1 point of sampling error. In other words, our sample means should be right around 50, plus or minus about 1 point on average. That is what the graphs above depict.
Then, with that knowledge, we can use probability and our alpha (α) to “draw our line in the sand” and designate our critical region or regions. This would allow us to determine the point at which a sample mean of people who received our treatment would be so unusual that we would decide that it was very unlikely that the treatment didn’t have an effect. Setting the alpha level will determine exactly how “unlikely.”
Let’s say the researchers use an alpha level of α = 0.05. This means that they only want a 5% chance (probability of 0.05) that they would reject the null hypothesis (claim an effect, difference, or correlation) when the null hypothesis is actually true (there is no effect, difference, or correlation).
This 5% (0.05 probability) can then be applied to the distribution of sample means. Specifically, we will determine the 5% of the distribution of sample means that is the most inconsistent with the null hypothesis and thus most consistent with the alternative hypothesis.
There are two ways this can be done:
- Directional Hypothesis, or “One-Tailed” Hypothesis
- Non-Directional Hypothesis, or “Two-Tailed” Hypothesis
We will explore directional hypotheses first because they tend to be the most consistent with common sense. However, non-directional hypotheses are more common because they are considered to be a little more conservative and more appropriate for most situations.
Directional or One-Tailed Hypotheses
Directional hypothesis tests specify the “direction” that the researchers predict the results would go if there is an effect, difference, or correlation. For example, the researchers exploring the impact of sleep deprivation on memory would likely predict that sleep deprivation would lower memory scores. In other words, they would predict that the direction of the effect would be a negative effect.
The hypotheses in this study would be:
- Alternative: Sleep Deprivation will reduce memory.
- Null: Sleep deprivation will not reduce memory.
Or, in symbols:
- H1: μsleep deprivation < 50
- H0: μsleep deprivation ≥ 50
Then, if we combine an alpha of α = 0.05 and non-directional hypotheses, we can look at the distribution of sample means and identify the 5% of the distribution that is inconsistent with the null hypothesis and thus consistent with the alternative hypothesis.
Using the null hypothesis, we would conclude that sample means near 50 or higher than 50 are consistent. However, sample means less than 50 are not as consistent. But the question is, how much lower than 50? That’s where the alpha comes in. It would be the 5% extremely low sample means. Thus, we get the following:
In this distribution, we have shaded the 5% of sample means that are unusually low. The sample means in this shaded area are inconsistent with the null hypothesis (that sleep deprivation will not reduce memory) and instead are more consistent with the alternative hypothesis (that sleep deprivation will reduce memory).
As you can see, we have shaded only a single tail of the distribution. That is why directional hypotheses are called “one-tailed” hypotheses.
Remember that the scientific method requires us to test the null hypothesis. We actually never test the alternative hypothesis. So, instead of possibly being able to prove that sleep deprivation impairs memory, the best we can do is get results that are inconsistent with the idea that sleep deprivation does not impair memory. In other words, the most scientifically backed treatments or effects have not been proven but instead have consistently failed to be disproven.
Technically, what we have just shaded is called the critical region. It is the region of the distribution where the researcher would start to question the null hypothesis.
We can then use the critical region to “draw a line in the sand,” where that line is the cut-off point at which a sample mean is now in the critical region. To do this, we will simply use the techniques we learned in the previous two chapters to determine a cut-off point for a distribution percentage.
Looking at the distribution of sample means above, we can see that the shaded area is a “tail.” We then need to convert our percentage to a proportion. To convert our percentage of 5% to a proportion, we simply divide it by 100. Thus, we look for the proportion of 0.0500 (make sure to use four decimal places because the proportions in our Unit Normal Table go to four decimal places).
As you can see in the table, the exact proportion of 0.0500 does not exist in the Tail column. The two closest proportions to 0.0500 are 0.0505 and 0.0495. Typically, we would then pick the proportion that is closest to our proportion. By subtracting each of the proportions in the table from our proportion of 0.0500, we find that both 0.0505 and 0.0495 are exactly 0.0005 or 5 ten-thousandths away:
0.0505 – 0.0500 = 0.0005
0.0500 – 0.0495 = 0.0005
As stated in Chapter 6, when a situation like this happens and the target proportion is equidistant from two proportions in the Unit Normal Table, we will choose the higher of the two z-scores. At the time, we indicated that the reason for this would make more sense in the future and has to do with being careful not to exceed a specified probability when we are using hypothesis testing and inferential statistics. And now is the time to explain that.
When we set an alpha level of α = 0.05, we are declaring that we only want a 0.05 probability, or 5% chance, that we might make a Type I error (when we claim an effect, difference, or correlation when, in reality, there isn’t one). If we follow our rule that when a target proportion is equidistant from two proportions in the Unit Normal Table and choose the higher z-score, in this case, z = 1.65, we would technically have a tail of exactly 0.0495, or 4.95%, which is a bit smaller than we wanted. If, on the other hand, we choose the lower z-score, z = 1.64, we would technically have a tail of exactly 0.0505, or 5.05%, which is a bit larger than we wanted, and this is where we will have a problem in terms of using inferential statistics. With an alpha level of α = 0.05, we are essentially declaring that we want no more than a 0.05 probability, or 5% chance, that we might make a Type I error. Thus, using the z-score cutoff of 1.64 would violate that, and thus we use the higher z-score of z = 1.65.
Be careful, though, because there is one last step that can be critically important. We should now return to our distribution of sample means to label the “line in the sand” for our critical region. Technically, this is called a critical value. To do this, however, we need to:
- Add a second axis below the axis for the Sample Means (M)
- Draw a tick mark (or tick marks in the case of two-tailed tests) at the critical region cutoff
- Label the tick mark (or tick marks) with the appropriate z-score(s) that we just found in the Unit Normal Table for the appropriate alpha level.
- Make sure that the sign (“+” or “-“) is correct. Z-scores to the left of the mean are negative. Z-scores to the right of the mean are positive.
Because our critical region is to the left of the mean, the z-score cutoff should be -1.65.
We have now drawn a “line in the sand,” indicating the point at which we will officially think that a statistical result is too weird to happen if the null is true. This cutoff point is called the critical value and is often depicted as:
zcritical = -1.65
The word “critical” in the subscript is an indication that this is the critical value of the z-score that determines our critical region. It is abbreviated as “zcrit” in some places.
Now that we’ve drawn the “line in the sand” and determined exactly how extreme of a result we would need before we would question the null hypothesis, we are ready to run our study and see if sleep deprivation leads to a memory score that is extremely low enough that we would question the hypothesis that sleep deprivation will not reduce memory scores (the null hypothesis).
Non-Directional or Two-Tailed Hypotheses
Unlike directional hypotheses, non-directional hypotheses do not specify a direction but are instead interested in detecting impacts in either direction. Using the example of the research on sleep deprivation and memory, a non-directional hypothesis test would predict that sleep deprivation would impact memory in any direction. In other words, they would predict that the effect could either be positive or negative.
The hypotheses in this study would be:
- Alternative: Sleep Deprivation will affect memory.
- Null: Sleep deprivation will not affect memory.
Or, in symbols:
- H1: μsleep deprivation ≠ 50
- H0: μsleep deprivation = 50
Then, if we combine an alpha of α = 0.05 and non-directional hypotheses, we can look at the distribution of sample means and identify the 5% of the distribution that is inconsistent with the null hypothesis and thus consistent with the alternative hypothesis.
Using the null hypothesis, we would conclude that sample means near 50 are consistent. However, sample means that are either more or less than 50 are not as consistent. But the question is, how much lower or higher than 50? That’s where the alpha comes in. It would be the 5% extreme sample means (in either direction). Thus, we get the following:
In this distribution, we have shaded the 5% of sample means that are either extremely low or extremely high. The sample means in this shaded area are inconsistent with the null hypothesis (that sleep deprivation will not affect memory) and instead are more consistent with the alternative hypothesis (that sleep deprivation will affect memory).
As you can see, we have shaded two tails in the distribution. That is why directional hypotheses are called “two-tailed” hypotheses.
Because there are two tails, it is important to note that when using a two-tailed hypothesis, our alpha level of α = 0.05 needs to be split in half to be distributed equally on both sides. Thus, each of the tails depicts 2.5% (half of 5%) of the entire distribution.
Technically, what we have just shaded is called the critical region. It is the region of the distribution where the researcher would start to question the null hypothesis.
We can then use the critical region to “draw lines in the sand,” where those lines are the cut-off points at which a sample mean would be considered to be in the critical region. To do this, we will simply use the techniques we learned in the previous two chapters to determine the cut-off points for a distribution percentage.
Looking at the distribution of sample means above, we can see that the shaded areas are both “tails.” We then need to convert our percentage to a proportion. At this point, we can focus on the tail to the right. To convert our percentage of 2.5% to a proportion, we simply divide it by 100. Thus, we look for the proportion of 0.0250 (make sure to use four decimal places because the proportions in our Unit Normal Table go to four decimal places).
As you can see in the table, the exact proportion of 0.0250 does exist in the Tail column. Thus, we will use the corresponding z-score of 1.96 as our cutoff for the tail to the right. Because normal distributions are symmetrical, we also now know the z-score cutoff for the tail to the left; it will be -1.96.
We can now return to our distribution of sample means to label the “lines in the sand” for our critical regions, our critical values. To do this, however, we need to:
- Add a second axis below the axis for the Sample Means (M)
- Draw a tick mark (or tick marks in the case of two-tailed tests) at the critical region cutoff
- Label the tick mark (or tick marks) with the appropriate z-score(s) that we just found in the Unit Normal Table for the appropriate alpha level.
- Make sure that the sign (“+” or “-“) is correct. Z-scores to the left of the mean are negative. Z-scores to the right of the mean are positive.
We have now drawn the “lines in the sand,” indicating the points at which we will officially think that a statistical result is too weird to happen if the null is true. These cutoff points are called the critical values and are depicted as:
zcritical = ±1.96
It can be helpful to remember that non-directional, or two-tailed, hypotheses will always have two critical values and will use the “plus/minus” sign (±).
Now that we’ve drawn the “lines in the sand” and determined exactly how extreme of a result we would need before we would question the null hypothesis, we are ready to run our study and see if sleep deprivation leads to a memory score that is extreme enough that we would question the hypothesis that sleep deprivation will not affect memory scores (the null hypothesis).
Why is this step of establishing the criterion for a decision important?
- Setting a criterion before running the research study helps prevent researchers from drawing their preferred conclusions (the alternative hypothesis) by using the “fuzziness” of probability.
- Inferential statistics involve making research decisions based on a limited amount of information. These decisions are essentially “guesses” about the true nature of the world, and the use of probability helps make those decisions “educated guesses.” By following the steps of hypothesis testing and establishing the criterion for a decision before running the study and collecting data, these “educated guesses” are not based on feelings or beliefs, but instead are based on logic, probability, and empirical data.
- Determining a specific criterion and sharing that with the research community (e.g., setting the alpha level and number of tails) establishes a transparent process that can be evaluated by the research community.
- Researchers are expected to set their criteria in a manner that results in a justifiable balance between Type I and Type II errors. While not all members of the research community may agree with the criteria, being transparent with them allows the community to address and discuss the relative merits of the research.
The probability of a research study rejecting the null hypothesis, given that the null hypothesis is true, P(reject H0 | H0 = true)
The probability of a research study rejecting the null hypothesis, given that the null hypothesis is true, P(reject H0 | H0 = true). Also known as, "significance level."
The region (or regions, in the case of non-directional hypotheses) of a distribution that contain test statistics that are unlikely to happen if the null hypothesis is true, and thus lead the researcher to reject the null hypothesis. Also known as "the region of rejection."
The test statistic score (or scores, in the case of non-directional hypotheses) that determines the cutoff for the critical region for a hypothesis test.
The region (or regions, in the case of non-directional hypotheses) of a distribution that contain test statistics that are unlikely to happen if the null hypothesis is true, and thus lead the researcher to reject the null hypothesis. Also known as the critical region.
An explanation for the difference between a sample statistic and the corresponding population parameter.
The average amount of sampling error you can expect between a sample mean with a given sample size, n, and the population mean.
Feedback/Errata