16 Chapter 17: Linear Regression
Alisa Beyer
In chapter 14, we learned about ANOVA, which involves a new way a looking at how our data are structured and the inferences we can draw from that. In chapter 16, we learned about correlations, which analyze two continuous variables at the same time to see if they systematically relate in a linear fashion. In this chapter, we will combine these two techniques in an analysis called simple linear regression, or regression for short. Regression uses the technique of variance partitioning from ANOVA to more formally assess the types of relations looked at in correlations. Regression is the most general and most flexible analysis covered in this book, and we will only scratch the surface.
A major practical application of statistical methods is making predictions. Psychologists often call this kind of prediction regression. Regression literally means going back or returning. We use the term regression here because the predicted score on the criterion variable is closer (in terms of standard deviation units) to the mean of the criterion variable compared to the distance from the value of the predictor variable to the mean of the predictor variable. So we can think of this in terms of the predicted value of the criterion variable regressing, or going back, toward the mean of the criterion variable. Again, the concepts in this chapter are directly related to correlation. This is because if two variables are correlated it means that we can predict one from the other. So if sleep the night before is correlated with happiness the next day, this means that we should be able, to some extent, predict how happy a person will be the next day from knowing how much sleep the person got the night before. The concepts in the chapter are also related to ANOVA as the goal of regression is the same as the goal of ANOVA: to take what we know about one variable (X) and use it to explain our observed differences in another variable (Y) – we are just two continuous variables.
Line of Best Fit
In correlations, we referred to a linear trend in the data. That is, we assumed that there was a straight line we could draw through the middle of our scatterplot that would represent the relation between our two variables, X and Y. Regression involves solving for the equation of that line, which is called the Line of Best Fit.
The distances between the line of best fit and each individual data point go by two different names that mean the same thing: errors and residuals. The term “error” in regression is closely aligned with the meaning of error in statistics (think standard error or sampling error); it does not mean that we did anything wrong, it simply means that there was some discrepancy or difference between what our analysis produced and the true value we are trying to get at it. The term “residual” is new to our study of statistics, and it takes on a very similar meaning in regression to what it means in everyday parlance: there is something left over. In regression, what is “left over” – that is, what makes up the residual – is an imperfection in our ability to predict values of the Y variable using our line. This definition brings us to one of the primary purposes of regression and the line of best fit: predicting scores.
Prediction
The goal of regression is the same as the goal of ANOVA: to take what we know about one variable (X) and use it to explain our observed differences in another variable (Y). In ANOVA, we talked about – and tested for – group mean differences, but in regression we do not have groups for our explanatory variable; we have a continuous variable, like in correlation. Because of this, our vocabulary will be a little bit different, but the process, logic, and end result are all the same.
Regression equation
Ŷ = a + bX
The terms in the equation are defined as:
Ŷ: the predicted value of Y for an individual person a: the intercept of the line
b: the slope of the line
X: the observed value of X for an individual person
Additionally we have formulas for a and b:
What this shows us is that we will use our known value of X for each person to predict the value of Y for that person. The predicted value, Ŷ, is called “y-hat” and is our best guess for what a person’s score on the outcome is. Notice also that the form of the equation is very similar to very simple linear equations that you have likely encountered before and has only two parameter estimates: an intercept (where the line crosses the Y-axis) and a slope (how steep – and the direction, positive or negative – the line is). These are parameter estimates because, like everything else in statistics, we are interested in approximating the true value of the relation in the population but can only ever estimate it using sample data. We will soon see that one of these parameters, the slope, is the focus of our hypothesis tests (the intercept is only there to make the math work out properly and is rarely interpretable).
Applied examples for using regression
Example 1: Businesses often have more applicants for a job than they have openings available, so they want to know who among the applicants is most likely to be the best employee. There are many criteria that can be used, but one is a personality test for conscientiousness, with the belief being that more conscientious (more responsible) employees are better than less conscientious employees. A business might give their employees a personality inventory to assess conscientiousness and existing performance data to look for a relation. In this example, we have known values of the predictor (X, conscientiousness) and outcome (Y, job performance), so we can estimate an equation for a line of best fit and see how accurately conscientious predicts job performance, then use this equation to predict future job performance of applicants based only on their known values of conscientiousness from personality inventories given during the application process.
Example 2: Assume a researcher is interested in examining whether SAT scores can be an accurate predictor of college GPA. In this case, SAT scores would be the predictor variable or X and college GPA would be the criterion variable or Y.
The key assessing whether a linear regression works well is the difference between our observed and known Y values and our predicted Ŷ values. As mentioned in passing above, we use subtraction to find the difference between them (Y – Ŷ) in the same way we use subtraction for deviation scores and sums of squares. The value (Y – Ŷ) is our residual, which, as defined above, is how close our line of best fit is to our actual values. We can visualize residuals to get a better sense of what they are by creating a scatterplot and overlaying a line of best fit on it, as shown in Figure 1.
Figure 1. Scatterplot with residuals
We call this property of the line of best fit the Least Squares Error Solution. This term means that the solution – or equation – of the line is the one that provides the smallest possible value of the squared errors (squared so that they can be summed, just like in standard deviation) relative to any other straight line we could draw through the data.
Predicting Scores and Explaining Variance
We have now seen that the purpose of regression is twofold: we want to predict scores based on our line and, as stated earlier, explain variance in our observed Y variable just like in ANOVA. These two purposes go hand in hand, and our ability to predict scores is literally our ability to explain variance. That is, if we cannot account for the variance in Y based on X, then we have no reason to use X to predict future values of Y.
We know that the overall variance in Y is a function of each score deviating from the mean of Y (as in our calculation of variance and standard deviation). So, just like the red brackets in figure 1 representing residuals, given as (Y – Ŷ), we can visualize the overall variance as each score’s distance from the overall mean of Y, given as (Y – ̅Y), our normal deviation score. This is shown in figure 2.
Figure 2. Scatterplot with residuals and deviation scores.
We now have three pieces of information: the distance from the observed score to the mean, the distance from the observed score to the prediction line, and the distance from the prediction line to the mean. These are our three pieces of information needed to test our hypotheses about regression and to calculate effect sizes. They are our three Sums of Squares, just like in ANOVA. Our distance from the observed score to the mean is the Sum of Squares Total, which we are trying to explain. Our distance from the observed score to the prediction line is our Sum of Squares Error, or residual, which we are trying to minimize. Our distance from the prediction line to the mean is our Sum of Squares Model, which is our observed effect and our ability to explain variance. Each of these will go into the ANOVA table to calculate our test statistic.
ANOVA Table
Our ANOVA table in regression follows the exact same format as it did for ANOVA (hence the name). Our top row is our observed effect, our middle row is our error, and our bottom row is our total. The columns take on the same interpretations as well: from left to right, we have our sums of squares, our degrees of freedom, our mean squares, and our F statistic.
Source |
SS |
df |
MS |
F |
Model |
∑(Ŷ − ̅Y)2 |
1 |
SSM/dfM |
MSM/MSE |
Error |
∑(Y − Ŷ)2 |
n-2 |
SSE/dfE |
|
Total |
∑(Y − ̅Y)2 |
n-1 |
|
|
As with ANOVA, getting the values for the SS column is a straightforward but somewhat arduous process. First, you take the raw scores of X and Y and calculate the means, variances, and covariance using the sum of products table introduced in our chapter on correlations. Next, you use the variance of X and the covariance of X and Y to calculate the slope of the line, b, the formula for which is given above. After that, you use the means and the slope to find the intercept, a, which is given alongside b. After that, you use the full prediction equation for the line of best fit to get predicted Y scores (Ŷ) for each person. Finally, you use the observed Y scores, predicted Y scores, and mean of Y to find the appropriate deviation scores for each person for each sum of squares source in the table and sum them to get the Sum of Squares Model, Sum of Squares Error, and Sum of Squares Total. As with ANOVA, you won’t be required to compute the SS values by hand, but you will need to know what they represent and how they fit together. The other columns in the ANOVA table are all familiar. The degrees of freedom column still has N – 1 for our total, but now we have N – 2 for our error degrees of freedom and 1 for our model degrees of freedom; this is because simple linear regression only has one predictor, so our degrees of freedom for the model is always 1 and does not change. The total degrees of freedom must still be the sum of the other two, so our degrees of freedom error will always be N – 2 for simple linear regression. The mean square columns are still the SS column divided by the df column, and the test statistic F is still the ratio of the mean squares. Based on this, it is now explicitly clear that not only do regression and ANOVA have the same goal but they are, in fact, the same analysis entirely. The only difference is the type of data we feed into the predictor side of the equations: continuous for regression and categorical for ANOVA.
Hypothesis Testing in Regression
Regression, like all other analyses, will test a null hypothesis in our data. In regression, we are interested in predicting Y scores and explaining variance using a line, the slope of which is what allows us to get closer to our observed scores than the mean of Y can. Thus, our hypotheses concern the slope of the line, which is estimated in the prediction equation by b. Specifically, we want to test that the slope is not zero:
H0: There is no explanatory relation between our variables, H0: ß = 0
HA: There is an explanatory relation between our variables, HA: ß ≠ 0
or if directional – specify direction for relation (positive or negative), HA: ß > 0, HA: ß < 0
A non-zero slope indicates that we can explain values in Y based on X and therefore predict future values of Y based on X. Our alternative hypotheses are analogous to those in correlation: positive relations have values above zero, negative relations have values below zero, and two-tailed tests are possible. Just like ANOVA, we will test the significance of this relation using the F statistic calculated in our ANOVA table compared to a critical value from the F distribution table. Let’s take a look at an example and regression in action.
Example: Happiness and Well-Being
Researchers are interested in explaining differences in how happy people are based on how healthy people are. They gather data on each of these variables from 18 people and fit a linear regression model to explain the variance. We will follow the four-step hypothesis testing procedure to see if there is a relation between these variables that is statistically significant.
Step 1: State the Hypotheses
The null hypothesis in regression states that there is no relation between our variables. The alternative states that there is a relation, but because our research description did not explicitly state a direction of the relation, we will use a non- directional hypothesis.
H0: There is no explanatory relation between health and happiness, H0: ß = 0
HA: There is an explanatory relation between health and happiness, HA: ß ≠ 0
Step 2: Find the Critical Value
Because regression and ANOVA are the same analysis, our critical value for regression will come from the same place: the F distribution table, which uses two types of degrees of freedom. We saw above that the degrees of freedom for our numerator – the Model line – is always 1 in simple linear regression, and that the denominator degrees of freedom – from the Error line – is N – 2. In this instance, we have 18 people so our degrees of freedom for the denominator is 16. Going to our F table, we find that the appropriate critical value for 1 and 16 degrees of freedom is F* = 4.49, shown below in figure 3.
Figure 3. Critical value from F distribution table
Step 3: Calculate the Test Statistic
The process of calculating the test statistic for regression first involves computing the parameter estimates for the line of best fit. To do this, we first calculate the means, standard deviations, and sum of products for our X and Y variables, as shown below.
X |
(X − ̅X) |
(X − ̅X)2 |
Y |
(Y − ̅Y) |
(Y − ̅Y)2 |
(X − ̅X)(Y − ̅Y) |
17.65 |
-2.13 |
4.53 |
10.36 |
-7.10 |
50.37 |
15.10 |
16.99 |
-2.79 |
7.80 |
16.38 |
-1.08 |
1.16 |
3.01 |
18.30 |
-1.48 |
2.18 |
15.23 |
-2.23 |
4.97 |
3.29 |
18.28 |
-1.50 |
2.25 |
14.26 |
-3.19 |
10.18 |
4.79 |
21.89 |
2.11 |
4.47 |
17.71 |
0.26 |
0.07 |
0.55 |
22.61 |
2.83 |
8.01 |
16.47 |
-0.98 |
0.97 |
-2.79 |
17.42 |
-2.36 |
5.57 |
16.89 |
-0.56 |
0.32 |
1.33 |
20.35 |
0.57 |
0.32 |
18.74 |
1.29 |
1.66 |
0.73 |
18.89 |
-0.89 |
0.79 |
21.96 |
4.50 |
20.26 |
-4.00 |
18.63 |
-1.15 |
1.32 |
17.57 |
0.11 |
0.01 |
-0.13 |
19.67 |
-0.11 |
0.01 |
18.12 |
0.66 |
0.44 |
-0.08 |
18.39 |
-1.39 |
1.94 |
12.08 |
-5.37 |
28.87 |
7.48 |
22.48 |
2.71 |
7.32 |
17.11 |
-0.34 |
0.12 |
-0.93 |
23.25 |
3.47 |
12.07 |
21.66 |
4.21 |
17.73 |
14.63 |
19.91 |
0.13 |
0.02 |
17.86 |
0.40 |
0.16 |
0.05 |
18.21 |
-1.57 |
2.45 |
18.49 |
1.03 |
1.07 |
-1.62 |
23.65 |
3.87 |
14.99 |
22.13 |
4.67 |
21.82 |
18.08 |
19.45 |
-0.33 |
0.11 |
21.17 |
3.72 |
13.82 |
-1.22 |
totals/∑ | ||||||
356.02 |
0.00 |
76.14 |
314.18 |
0.00 |
173.99 |
58.29 |
From the raw data in our X and Y columns, we find that the means are ̅X = 19.78 and ̅Y = 17.45. The deviation scores for each variable sum to zero, so all is well there. The sums of squares for X and Y ultimately lead us to standard deviations of Sx = 2.12 and Sy = 3.20. Finally, our sum of products is 58.29, which gives us a covariance of covXY = 3.43, so we know our relation will be positive. This is all the information we need for our equations for the line of best.
Ŷ = 2.42 + 0.77X
We can plot this relation in a scatterplot and overlay our line onto it, as shown in figure 4.
Figure 4. Health and happiness data and line.
We can use the line equation to find predicted values for each observation and use them to calculate our sums of squares model and error, but this is tedious to do by hand, so we will let the computer software do the heavy lifting in that column of our ANOVA table:
Source |
SS |
df |
MS |
F |
Model |
44.62 |
|
|
|
Error |
129.37 |
|
|
|
Total |
|
|
|
|
Now that we have these, we can fill in the rest of the ANOVA table. We already found our degrees of freedom in Step 2:
Source |
SS |
df |
MS |
F |
Model |
44.62 |
1 |
|
|
Error |
129.37 |
16 |
|
|
Total |
|
|
|
|
Our total line is always the sum of the other two lines, giving us:
Source |
SS |
df |
MS |
F |
Model |
44.62 |
1 |
|
|
Error |
129.37 |
16 |
|
|
Total |
173.99 |
17 |
|
|
Our mean squares column is only calculated for the model and error lines and is always our SS divided by our df, which is:
Source |
SS |
df |
MS |
F |
Model |
44.62 |
1 |
44.62 |
|
Error |
129.37 |
16 |
8.09 |
|
Total |
173.99 |
17 |
|
|
Finally, our F statistic is the ratio of the mean squares:
Source |
SS |
df |
MS |
F |
Model |
44.62 |
1 |
44.62 |
5.52 |
Error |
129.37 |
16 |
8.09 |
|
Total |
173.99 |
17 |
|
|
This gives us an obtained F statistic of 5.52, which we will now use to test our hypothesis.
Step 4: Make the Decision
We now have everything we need to make our final decision. Our obtained test statistic was F = 5.52 and our critical value was F* = 4.49. Since our obtained test statistic is greater than our critical value, we can reject the null hypothesis.
Effect Size
From the example above, we get R2 = .26. We are explaining 26% of the variance in happiness based on health, which is a large effect size (R2 uses the same effect size cutoffs as η2).
Accuracy in Prediction
We found a large, statistically significant relation between our variables, which is what we hoped for. However, if we want to use our estimated line of best fit for future prediction, we will also want to know how precise or accurate our predicted values are. What we want to know is the average distance from our predictions to our actual observed values, or the average size of the residual (Y − Ŷ). The average size of the residual is known by a specific name: the standard error of the estimate s(Y− Ŷ). The formula is almost identical to our standard deviation formula, and it follows the same logic. For our example, s(Y− Ŷ) = 2.84. So on average, our predictions are just under 3 points away from our actual values. There are no specific cutoffs or guidelines for how big our standard error of the estimate can or should be; it is highly dependent on both our sample size and the scale of our original Y variable, so expert judgment should be used. In this case, the estimate is not that far off and can be considered reasonably precise.
Quick recap of regression (without the math)
Two variables of regression
1. Predictor (X)
2. Criterion (Y)
With correlation it did not matter which variable was the predictor variable or the criterion variable. But with prediction we have to decide which variable is being predicted from and which variable is being predicted. The variable being predicted from is called the predictor variable. The variable being predicted is called the criterion variable. In equations, the predictor variable is usually labeled X, and the criterion is labeled Y.
The Linear Prediction Rule: Ideally we want to make a prediction rule that is both simple and depends on every case for each prediction. In a linear prediction rule the formal name for the baseline number is the regression constant or just constant. It has the name constant because it is a fixed value that we always add in to the prediction.
The number we multiplied by the person’s score on the predictor variable, b, is called the regression coefficient because a “coefficient” is a number we multiply by something.
Let’s revisit example 2, predicting college GPA from SAT scores. For our SAT and GPA example, the rule might be “to predict a person’s graduating GPA, start with .3 and at the result of multiplying .004 by the person’s SAT scores”. So, the baseline number (a) would be .3 and the predictor value (b) is .004. If a person had an SAT of 600 we would predict the person would graduate with a GPA of 2.7. This idea is known as the linear prediction rule. Lows go with lows and highs with highs, or lows with highs and highs with lows.
Criterion Variable (Ŷ)
The variable we are predicting in a regression equation is called the criterion variable. It is labeled as Ŷ. The mark above Y indicates that this variable is a predicted variable and is dependent on the value of X.
Slope of the Regression Line (b)
The steepness of the angle of the regression line, called its slope, is the amount that the line moves up for every unit it is moved across. In our SAT example the line moves up .004 on the GPA scale for every additional point on the SAT. In fact, the slope of the line is exactly b, the regression coefficient.
Intercept of the Regression Line (a)
The point at which the regression line crosses or intersects the vertical axis is called the intercept.
- The intercept is the predicted score on the criterion variable when the score on the predictor variable is 0. It turns out that the intercept is the same as the regression constant.
- The reason this works is the regression constant is the number we always add in – a kind of baseline number, the number we start with.
- It is reasonable that the best baseline number would be the number we predict from a predictor score of 0.
In the SAT example the line crosses the vertical axis app .3. That is, when a person has an SAT score of zero, they are predicted to have a college GPA .3.
Linear regression standardized coefficient (β)
Multiple Regression and Other Extensions
Simple linear regression as presented here is only a stepping stone towards an entire field of research and application. Regression is an incredibly flexible and powerful tool, and the extensions and variations on it are far beyond the scope of this chapter (indeed, even entire books struggle to accommodate all possible applications of the simple principles laid out here). The next step in regression is to study multiple regression, which uses multiple X variables as predictors for a single Y variable at the same time. The math of multiple regression is very complex but the logic is the same: we are trying to use variables that are statistically significantly related to our outcome to explain the variance we observe in that outcome. Other forms of regression include curvilinear models that can explain curves in the data rather than the straight lines used here, as well as moderation models that change the relation between two variables based on levels of a third. The possibilities are truly endless and offer a lifetime of discovery.
Learning Objectives
Having read this chapter, a student should be able to:
- Explain the concept of a linear equation, including slope and intercept
- Explain how regression is related to correlation and ANOVA
- Understand the concept of least-square solution
- Understand the concept of multiple regression
Exercises – Ch. 17
- How are ANOVA and linear regression similar? How are they different?
- What is a residual?
- How are correlation and regression similar? How are they different?
- What are the two parameters of the line of best fit, and what do they represent?
- What is our criteria for finding the line of best fit?
- Fill out the rest of the ANOVA tables below for simple linear regressions: a.
Source |
SS |
df |
MS |
F |
Model |
34.21 |
1 |
34.21 |
|
Error |
|
|
|
|
Total |
66.12 |
54 |
|
|
7. In chapter 15, we found a statistically significant correlation between overall performance in class and how much time someone studied. Use the summary statistics calculated in that problem (provided here) to compute a line of best fit predicting success from study times: ̅X = 1.61, sX = 1.12, ̅Y = 2.95, sY = 0.99, r = 0.65.
8. Using the line of best fit equation created in problem 7, predict the scores for how successful people will be based on how much they study:
a. X = 1.20
b. X = 3.33
c. X = 0.71
d. X = 4.00
9. You have become suspicious that the draft rankings of your fantasy football league have no predictive value for how teams place at the end of the season. You go back to historical league data and find rankings of teams after the draft and at the end of the season (below) to test for a statistically significant predictive relation. Assume SSM = 2.65 and SSE = 337.35
Draft Projection |
Final Rankings |
1 |
14 |
2 |
6 |
3 |
8 |
4 |
13 |
5 |
2 |
6 |
15 |
7 |
4 |
8 |
10 |
9 |
11 |
10 |
16 |
11 |
9 |
12 |
7 |
13 |
14 |
14 |
12 |
15 |
1 |
16 |
5 |
10. You have summary data for two variables: how extroverted some is (X) and how often someone volunteers (Y). Using these values, calculate the line of best fit predicting volunteering from extroversion then test for a statistically significant relation using the hypothesis testing procedure: ̅X = 12.58, sX =4.65, ̅Y = 7.44, sY = 2.12, r = 0.34, N = 67, SSM = 19.79, SSE = 215.77.
Answers to Odd- Numbered Exercises – Ch. 17
1. ANOVA and simple linear regression both take the total observed variance and partition it into pieces that we can explain and cannot explain and use the ratio of those pieces to test for significant relations. They are different in that ANOVA uses a categorical variable as a predictor whereas linear regression uses a continuous variable.
Our given SS values and our df from step 2 allow us to fill in the ANOVA table:
Source |
SS |
df |
MS |
F |
Model |
2.65 |
1 |
2.65 |
0.11 |
Error |
337.35 |
14 |
24.10 |
|
Total |
339.86 |
15 |
|
|
Step 4: Our obtained value was smaller than our critical value, so we fail to reject the null hypothesis. There is no evidence to suggest that draft rankings have any predictive value for final fantasy football rankings, F(1,14) = 0.11, p > .05