4.5 – Why Are the Variance Formulas Different?
[latex]\text{Population Variance}=\sigma^2=\frac{SS}{N}[/latex]
[latex]\text{Sample Variance}=s^2=\frac{SS}{n-1}[/latex]
As you can see above, the formulas for population and sample variance are slightly different (the denominator changes from “N” to “n – 1″), and thus you will get a different variance result for the same set of scores depending on which formula you use. You may also note that a variance calculation using the population variance formula is going to be smaller than a variance calculation using the sample variance formula. There is a very good reason for these differences.
Because samples don’t include everyone from the population, they are technically only estimates of the variability (remember, the purpose of a sample is to give us an estimate of what is going on in the population). And it turns out that samples tend to underestimate the variability in the population, and so statisticians altered the formula slightly to account for this tendency.
A simple mathematical strategy to increase the size of the result of a fraction is to make the denominator smaller, which means you will be dividing by a smaller number and thus will get a larger result. The results of the above variance formulas demonstrate this with our previous set of data:
X | X2 |
8 | 82 = 64 |
5 | 52 = 25 |
5 | 52 = 25 |
4 | 42 = 16 |
3 | 32 = 9 |
[latex]\Sigma X = 25[/latex] | [latex]\Sigma X^2 = 139[/latex] |
[latex]\text{Sum of Squares (Computational Formula)}=SS=\Sigma X^2-\frac{(\Sigma X)^2}{N}[/latex]
[latex]=139 - \frac{(25)^2}{5} = 139 - \frac{625}{5} = 139 - 125 = 14[/latex]
[latex]\text{Population Variance}=\sigma^2=\frac{SS}{N}=\frac{14}{5}=2.8[/latex]
[latex]\text{Sample Variance}=s^2=\frac{SS}{n-1}=\frac{14}{5-1}=\frac{14}{4}=3.5[/latex]
In both variance fractions ([latex]\frac{14}{5} \text{ and } \frac{14}{4}[/latex]) the numerator is 14, but in the sample variance version, the denominator is smaller, and thus the variance result is larger (3.5 versus 2.8). Thus, using n – 1 in the denominator of the sample variance formula will make the result larger than it would be if we just used n.
Degrees of Freedom
It is then reasonable to ask why “n – 1″ and not “n – 2″ or “n – 1.43?” It turns out that there is a very specific reason, and it has to do with a concept called degrees of freedom (df). When calculating a statistic based on a sample of scores, there are certain constraints on how many of the scores in the sample are independent and thus “free to vary” and degrees of freedom measures how many of the scores are free to vary.
It is difficult to make sense of the degrees of freedom concept without a tangible example, so let’s imagine that we have a sample of n = 3 scores that have a mean of M = 10. What are the three scores?
At this point, that might feel like a silly question because there are lots of possible sets of 3 scores that have a mean of M = 10. For example, all of the following sets of 3 scores have a mean of M = 10:
- 10, 10, 10
- 9, 10, 11
- 5, 10, 15
- -100, 30, 100
- 6.75, 14.25, 9
- …
What we are seeing from all of these examples is that the scores in our group are “free to vary.” If you look at the above examples, there are lots of different possibilities for each of the scores. In other words, they “vary.” In order for the three scores to have a mean of M = 10, the three scores simply need to add up to 30, because if we want a group of n = 3 scores to have a mean of M = 10, the scores need to have a Σx = 30:
[latex]M=\frac{\Sigma x}{n}[/latex]
[latex]10=\frac{\Sigma x}{3}[/latex]
[latex]10(3)=\frac{\Sigma x(3)}{3}[/latex]
[latex]30=\Sigma x[/latex]
However, what if you were asked to pick just the first two scores? Again, you would be able to come up with lots of possible examples. For example, you might come up with 6 for the first score and 13 for the second score. Or you might come up with 3.4 and -16. The point is that the first two scores are free to vary. However, let’s stick with our first choice. We picked the scores of 6 and 13. Now, what is the third score? At this point, you might notice the third score is not free to vary. It has to be 11 because the three scores need to add up to 30: 6 + 13 + 11 = 30. In other words, the third score is not “free to vary.” Because we know the first two scores, and we know the mean is M = 10, the third score is set.
The first two scores, as we pointed out, are still free to vary. They don’t have to be 6 and 13. Maybe they are 25 and 2. In this case, the third score now must be 3 because they need to add up to 30: 25 + 2 + 3 = 30.
What we are starting to see is that our group of n = 3 scores with a mean of M = 10 have n – 1 scores that are free to vary. Since our n = 3, we can calculate the degrees of freedom:
[latex]df=n-1=3-1=2[/latex]
In other words, two of the three scores are free to vary.
Hopefully, you notice that the difference in our sample variance formula is that we use “n – 1″ in the denominator. This is not by chance. We are simply using the degrees of freedom.
An Unbiased Sample Variance
As noted above, we use “n – 1″ in the denominator of the Sample Variance calculation because it reflects the degrees of freedom. It is also worth noting that using “n – 1″ in the denominator results in an “unbiased” estimate of the population variance.
Remember that when we use a sample statistic to measure a population parameter it is just an estimate because samples aren’t always representative of the population. And, as noted above, samples tend to underestimate the population variance. It doesn’t happen with every sample, but it will happen with most samples. This is why we use “n – 1″ in the denominator because it will then increase the size of the sample variance estimate.
But what is particularly interesting, though, is that it adjusts it amazingly well, leading to an “unbiased” estimate.
Let’s look at an example to understand this. Suppose we have the following population of scores:
- Scores: 1, 1, 4, 4, 10, 10
- Population Size: N = 6
- Population Mean: μ = 5
- Population Variance: σ2 = 14
Now, imagine randomly sampling two scores (sample size of n = 2) from that population. Below are all the possible combinations of scores from this population that we could get, with their corresponding sample statistics:
Sample | First Score |
Second Score |
Mean (M) |
Sample Variance (using n) |
Sample Variance (using n – 1) |
1 | 1 | 1 | 1 | 0 | 0 |
1 | 1 | 4 | 2.5 | 2.25 | 4.5 |
1 | 1 | 10 | 5.5 | 20.25 | 40.5 |
1 | 4 | 1 | 2.5 | 2.25 | 4.50 |
2 | 4 | 4 | 4 | 0 | 0 |
3 | 4 | 10 | 7 | 9 | 18 |
4 | 10 | 1 | 5.5 | 20.25 | 40.5 |
5 | 10 | 4 | 7 | 9 | 18 |
6 | 10 | 10 | 10 | 0 | 0 |
Totals: | 45 | 63 | 126 |
There are a couple of important patterns to note about these results:
- None of the sample means provides a perfect estimate of the true population mean (which is μ = 5), which is due to sampling error
- However, the average of all the sample means is exactly the population mean (take the total of the sample means, 45, and divide by the number of sample means, 9, and you get 5)
- Likewise, none of the sample variances provides a perfect estimate of the true population variance (which is σ2 = 14), again due to sampling error
- However, only the average of all the sample variances calculated using n – 1 is exactly the population variance (take the total of the sample variances calculated using n – 1, 126, and divide by the number of sample variances, 9, and you get 14)
What this demonstrates is twofold. First, sample statistics aren’t perfect. Remember that they are only able to estimate the population parameters.parameters. And their ability to make an accurate estimate is going to be based on how well the sample represents the population. However, sample statistics actually do a perfect job of estimating when taken all together. In other words, some sample statistics will underestimate, and some will overestimate, but on average they can give us an accurate estimate.
Second, and this is important to our formula for sample variance, is that using n – 1 in the denominator of the variance calculation creates an “unbiased” estimate of the population variance. If we used n in the denominator instead, our sample variance would be “biased” and underestimate the population variance. To see this, simply take the total of the sample variances calculated using n, 63, and divide that by the number of those sample variances, 9, and you get 7. In other words, if we use n in the denominator, the average of all the sample variances still does not estimate the population parameter, and instead underestimates it.
In the end, though, all of this simply demonstrates the reasons why we use n – 1 in the denominator of the sample variance formula. It’s not important to understand the specifics; you just need to use the correct formula.
The number of scores in the calculation of a statistic that are independent and thus free to vary.
A measurement based upon a sample taken from the population of interest.
A measurement based upon all of the individuals in the population of interest.
An explanation for the difference between a sample statistic and the corresponding population parameter.
A set of calculations or strategies that aid researchers in answering research questions.
Feedback/Errata