# 12 Chapter 12: Repeated Measures t-test

So far, we have dealt with data measured on a single variable at a single point in time, allowing us to gain an understanding of the logic and process behind statistics and hypothesis testing. Now, we will look at a slightly different type of data that has new information we couldn’t get at before: change. Specifically, we will look at how the value of a variable, within people, changes across two timepoints. This is a very powerful thing to do, and, as we will see shortly, it involves only a very slight addition to our existing process and does not change the mechanics of hypothesis testing or formulas at all!

### Change and Differences

Researchers are often interested in change over time. Sometimes we want to see if change occurs naturally, and other times we are hoping for change in response to some manipulation. In each of these cases, we measure a single variable at different times, and what we are looking for is whether or not we get the same score at time 2 as we did at time 1. This is a repeated sample research design, where a single group of individuals is obtained and each individual is measured in two treatment conditions  that are then compared.  Data consist of two scores for each individual. This means that all subjects participate in each treatment condition. Think about it like a pretest/posttest.

When we analyze data for a repeated research design, we calculate the difference between members of each pair of scores and then take the average of those differences. The absolute value of our measurements does not matter – all that matters is the change. If the average difference between scores in our sample is very large, compared to the difference between scores we would expect if the member was selected from the same population then we will conclude that the individuals were selected from different populations.

Let’s look at an example:

 Before After Improvement 6 9 3 7 7 0 4 10 6 1 3 2 8 10 2

Table 1. Raw and difference scores before and after training.

Table 1 shows scores on a quiz that five employees received before they took a training course and after they took the course. The difference between these scores (i.e. the score after minus the score before) represents improvement in the employees’ ability. This third column is what we look at when assessing whether or not our training was effective. We want to see positive scores, which indicate that the employees’ performance went up. What we are not interested in is how good they were before they took the training or after the training. Notice that the lowest scoring employee before the training (with a score of 1) improved just as much as the highest scoring employee before the training (with a score of 8), regardless of how far apart they were to begin with. There’s also one improvement score of 0, meaning that the training did not help this employee. An important factor in this is that the participants received the same assessment at both time points. To calculate improvement or any other difference score, we must measure only a single variable.
When looking at change scores like the ones in Table 2, we calculate our difference scores by taking the time 2 score and subtracting the time 1 score. That is:
The difference score formula:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Note: T2 is the time 2 variable; T1 is the time 1 variable
Where XD is the difference score, XT1 is the score on the variable at time 1, and XT2 is the score on the variable at time 2. The difference score, XD (can also be noted as D for difference score), will be the data we use to test for improvement or change. Whether a difference score is positive or negative depends on the direction of change; it does not denote big or small, good or bad. The sign of the difference score (XD or D) denotes the direction of the change.
We subtract time 2 minus time 1 for ease of interpretation; if scores get better, then the difference score will be positive. Similarly, if we’re measuring something like reaction time or depression symptoms that we are trying to reduce, then better outcomes (lower scores) will yield negative difference scores.
We can also test to see if people who are matched or paired in some way agree on a specific topic. We call this a matched design. For example, we can see if a parent and a child agree on the quality of home life, or we can see if two romantic partners agree on how serious and committed their relationship is. In these situations, we also subtract one score from the other to get a difference score. This time, however, it doesn’t matter which score we subtract from the other because what we are concerned with is the agreement.

In both of these types of data, what we have are multiple scores on a single variable. That is, a single observation or data point is comprised of two measurements that are put together into one difference score. This is what makes the analysis of change unique – our ability to link these measurements in a meaningful way. This type of analysis would not work if we had two separate samples of people that weren’t related at the individual level, such as samples of people from different states that we gathered independently. Such datasets and analyses are the subject of the following chapter.

#### A rose by any other name…

It is important to point out that this form of t-test has been called many different things by many different people over the years: “matched pairs”, “paired samples”, “repeated measures”, “dependent measures”, “dependent samples”, and many others. What all of these names have in common is that they describe the analysis of two scores that are related in a systematic way within people or within pairs, which is what each of the datasets usable in this analysis have in common. As such, all of these names are equally appropriate, and the choice of which one to use comes down to preference. In this text, we will refer to paired samples, though the appearance of any of the other names throughout this chapter should not be taken to refer to a different analysis: they are all the same thing.

We are still working with t-tests.  In chapter 11, we compared a sample to a population mean.  For t-tests in this chapter, we are comparing 2 groups of scores, yet both are from the same individuals.  We call this a dependent t-test or a paired t-test.  Think of it like you are having 2 cups of tea.

2 cups of tea for me: for a repeated measures design the same individuals are in both conditions for a t-test. Photo credit

Now that we have an understanding of what difference scores are and know how to calculate them, we can use them to test hypotheses. As we will see, this works exactly the same way as testing hypotheses about one sample mean with a t- statistic. The only difference is in the format of the null and alternative hypotheses, where for focus on the difference score.

#### Hypotheses of Change and Differences for step 1

When we work with difference scores, our research questions have to do with change. Did scores improve? Did symptoms get better? Did prevalence go up or down? Our hypotheses will reflect this. Remember that the null hypothesis is the idea that there is nothing interesting, notable, or impactful represented in our dataset. In a paired samples t-test, that takes the form of ‘no change’. There is no improvement in scores or decrease in symptoms.

Thus, our null hypothesis is: H0: There is no change or difference H0: μD = 0
Let’s be clear, H0: μD = 0 does not say that everyone in the population will stay the same it only says that on average, the entire population will show a mean difference of 0. As with our other null hypotheses, we express the null hypothesis for paired samples t-tests in both words and mathematical notation. The exact wording of the written-out version should be changed to match whatever research question we are addressing (e.g. “ There is no change in ability scores after training”). However, the mathematical version of the null hypothesis is always exactly the same: the average change score is equal to zero. Our population parameter for the average is still μ, but it now has a subscript D to denote the fact that it is the average change score and not the average raw observation before or after our manipulation. Obviously individual difference scores can go up or down, but the null hypothesis states that these positive or negative change values are just random chance and that the true average change score across all people is 0.
Our alternative hypotheses will also follow the same format that they did before: they can be directional if we suspect a change or difference in a specific direction, or we can use an inequality sign to test for any change:
HA: There is a change or difference HA: μD ≠ 0
HA: The average score increases HA: μD > 0
HA: The average score decreases HA: μD < 0

Just as before, you choice of which alternative hypothesis to use should be specified before you collect data based on your research question and any evidence you might have that would indicate a specific directional (or non-directional) change.  Additionally, it should be noted that a non-directional research/alternative hypothesis is a more conservative approach when you have an expected direction for change.

Choosing 1-tail vs 2-tail test

How do you choose whether to use a one-tailed versus a two-tailed test? The two-tailed test is always going to be more conservative, so it’s always a good bet to use that one, unless you had a very strong prior reason for using a one-tailed test. In that case, you should have written down the hypothesis before you ever looked at the data. In Chapter 19, we will discuss the idea of pre-registration of hypotheses, which formalizes the idea of writing down your hypotheses before you ever see the actual data. You should never make a decision about how to perform a hypothesis test once you have looked at the data, as this can introduce serious bias into the results.

We do have to make one main assumption when we use the randomization test, which we refer to as exchangeability. This means that all of the observations are distributed in the same way, such that we can interchange them without changing the overall distribution. The main place where this can break down is when there are related observations in the data; for example, if we had data from individuals in 4 different families, then we couldn’t assume that individuals were exchangeable, because siblings would be closer to each other than they are to individuals from other families. In general, if the data were obtained by random sampling, then the assumption of exchangeability should hold.

#### Critical Values and Decision Criteria for step 2

As with before, once we have our hypotheses laid out, we need to find our critical values that will serve as our decision criteria. This step has not changed at all from the last chapter. Our critical values are based on our level of significance (still usually α = 0.05), the directionality of our test (one-tailed or two-tailed), and the degrees of freedom, which are still calculated as df = n – 1. Because this is a t-test like the last chapter, we will find our critical values on the same t-table using the same process of identifying the correct column based on our significance level and directionality and the correct row based on our degrees of freedom or the next lowest value if our exact degrees of freedom are not presented. After we calculate our test statistic, our decision criteria are the same as well: p < α or tobt > tcrit*.

#### Test Statistic for step 3

Our test statistic for our change scores follows exactly the same format as it did for our 1-sample t-test. In fact, the only difference is in the data that we use. For our change test, we first calculate a difference score as shown above. Then, we use those scores as the raw data in the same mean calculation, standard error formula, and t-statistic. Let’s look at each of these.

Mean Difference (top of t-formula):                                                                                                                                                           which can also be noted as                                                                                                                                                                                                                                                                                                                                 The mean difference score is calculated in the same way as any other mean: sum each of the individual difference scores and divide by the sample size.

Here we are using the subscript D to keep track of that fact that these are difference scores instead of raw scores; it has no actual effect on our calculation.

Using this, we calculate the standard deviation of the difference scores the same way as well:

Standard deviation for D (SD) and variance for D is sD2:                                                            or may see SD noted as    where xD = D & D̅ = MD                                                                                                                                                                                                                               Note: sD2 = sD * sD and sD = √sD2

We will find the numerator, the Sum of Squares, using the same table format that we learned in chapter 3. Once we have our standard deviation, we can find the standard error:

Standard Error                                                                                                                                                                                              Standard error of the mean differences (SMD) (bottom of t-formula):       which can also be noted as                                                                                                                                                                 Note: the formula can also be noted as SMD or S and you can calculate it from the variance (√(s2/n)) or standard deviation( s/√n)

Finally, our test statistic t has the same structure as well:

t-test for paired samples:                                                                                                                                                                   where μ(hyp) is expected to be 0 and is dropped from the calculation formula leaving or                                                                                                                                                             Note: Both formulas are the same with the mean noted as MD or and the estimated standard error notes as SMD or SD

Effect size:                                                                                                                                                                                                                                                                                                                                                                                                                  Cohen’s d                                                                                                                                                                                              There are several different ways that the effect size can be quantified, which depend on the nature of the data. One of the most common measures of effect size is known as Cohen’s d                                                                                                                                                                                                                                                                           Note: MD is the mean of the difference scores.                                                                                                                                                                                                                                                                                                                                                            Another way to examine effect size is to report the explained variance for the treatment effect, in other words the percentage of variance accounted for the treatment.  This is known as r2.                                                                                                                                                                                                                                                                              Note: ris calculated when there is a reported effect (in other words, null is rejected). Df is the same df from step 2.

As we can see, once we calculate our difference scores from our raw measurements, everything else is exactly the same. Let’s see an example.

Example: Increasing Satisfaction at Work

Workers at a local company have been complaining that working conditions have gotten very poor, hours are too long, and they don’t feel supported by the management. The company hires a consultant to come in and help fix the situation before it gets so bad that the employees start to quit. The consultant first assesses 49 of the employee’s level of job satisfaction as part of focus groups used to identify specific changes that might help. The company institutes some of these changes, and six months later the consultant returns to measure job satisfaction again. Knowing that some interventions miss the mark and can actually make things worse, the consultant tests for a difference in either direction (i.e. and increase or a decreased in average job satisfaction) at the α = 0.05 level of significance.
Step 1: State the Hypotheses
In this case, we are hoping that the changes we made will improve employee satisfaction, and, because we based the changes on employee recommendations, we have good reason to believe that they will. However we will take a conservative approach and will use a two-tail alternative hypothesis.
Thus, we state our null and alternative hypotheses as
H0: There is no change in average job satisfaction H0: μD = 0
HA: There is a change in average job satisfaction HA: μD ≠ 0
Step 2: Find critical value
Our critical values will once again be based on our level of significance, which we know is α = 0.05, the directionality of our test, which is two-tailed, and our degrees of freedom. For our dependent-samples t-test, the degrees of freedom are still given as df = n – 1. For this problem, we have 49 people, so our degrees of freedom are 48.  Our table does not have 48, so we go with the closest lower value (40). Going to our t-table, we find that the critical value is t* = 2.021. As shown in Figure 1, the cut off or critical value helps with decision making in step 4.
Figure 1. Critical region for two-tailed t-test at α = 0.05
Step 3: Calculate test statistic
Now that the criteria are set, it is time to calculate the test statistic. The data obtained by the consultant found that the difference scores from time 1 to time 2 had a mean of MD or D̅ = 2.96 and a standard deviation of sD = 2.85. Using this information, plus the size of the sample (n = 49), we first calculate the standard error:
Plugging in the values we get 2.85/(√49) = 2.85/7= 0.41
Now, we can put that value s = 0.41, along with our sample mean (2.96), into the formula for t and calculate the test statistic:
= 2.96/0.41 = 7.22
Notice that, because the null hypothesis value of a dependent samples t-test is always 0, we can simply divide our obtained sample mean by the standard error.
Step 4: Make a decision
We have obtained a test statistic of t = 7.22 that we can compare to our previously established critical value of t* = 2.021. 7.22 is larger than 2.021, so t > t* and we reject the null hypothesis:
Reject H0. Based on the sample data from 49 workers, we can say that the intervention statistically significantly improved job satisfaction (̅D= 2.96) among the workers, t(48) = 7.22, p < 0.05.
Because this result was statistically significant, we will want to calculate Cohen’s d as an effect size using the same format as we did for the last t-test:
where the MD or D̅ = 2.96 and a standard deviation of s = 2.85.  Plugging in the values we get d=2.96/2.85=1.04 which is a large effect size.  We could also calculate r2 for effect size.
where t= 7.22*7.22 = 52.13 and df = 48. Then plugging in, r2 = 52.13/(52.13+48) = .52. This can be interpreted as 52% o the variance in worker job satisfaction is due to changes the company made.

Hopefully the above example made it clear that running a dependent samples t-test to look for differences before and after some treatment works exactly the same way as a regular 1-sample t-test does from chapter 11 (which was just a small change in how z-tests were performed in chapter 10). At this point, this process should feel familiar, and we will continue to make small adjustments to this familiar process as we encounter new types of data to test new types of research questions.

Confidence Intervals                                                                                                                                                                                  Last chapter, CI = but now the mean is the mean difference ( D̅ or MD) and s/√n becomes s                                                                                                                                                                                                                                              Our adjusted CI formula for paired or dependent t-test:                                                                                                              CI = ± t(s )                                                                                                                                                                                        Note: We still calculate an upper bound and lower bound value and t is still the critical value t. CI formula is very similar using the notations for standard error. CI still notated as CI = (LB, UB).

### Example with Confidence Interval Hypothesis Testing: Bad Press

Let’s say that a bank wants to make sure that their new commercial will make them look good to the public, so they recruit 7 people to view the commercial as a focus group. The focus group members fill out a short questionnaire about how they view the company, then watch the commercial and fill out the same questionnaire a second time. The bank really wants to find significant results, so they test for a change at α = 0.05. However, they use a 2-tailed test since they know that past commercials have not gone over well with the public, and they want to make sure the new one does not backfire. They decide to test their hypothesis using a confidence interval to see just how spread out the opinions are. As we will see, confidence intervals work the same way as they did before, just like with the test statistic.

#### Step 1: State the Hypotheses

As always, we start with hypotheses, and with confidence interval hypothesis test, we must use a 2-tail test.

H0: There is no change in how people view the bank H0: μD = 0

HA: There is a change in how people view the bank HA: μD ≠ 0

#### Step 2: Find the Critical Values

Just like with our regular hypothesis testing procedure, we will need critical values from the appropriate level of significance and degrees of freedom in order to form our confidence interval. Because we have 7 participants, our degrees of freedom are df = 6. From our t-table, we find that the critical value corresponding to this df at this level of significance is t* = 2.447.

#### Step 3: Calculate the Confidence Interval

The data collected before (time 1) and after (time 2) the participants viewed the commercial is presented in Table 1. In order to build our confidence interval, we will first have to calculate the mean and standard deviation of the difference scores, which are also in Table 1. As a reminder, the difference scores ( or MD) are calculated as Time 2 – Time 1.

 Time 1 Time 2 D̅ 3 2 -1 3 6 3 5 3 -2 8 4 -4 3 9 6 1 2 1 4 5 1

Table 1. Opinions of the bank

The mean of the difference scores is: D̅ = 4/7 = .57

The standard deviation will be solved by first using the Sum of Squares Table:

 D D –D̅ (D –D̅)2 -1 -1.57 2.46 3 2.43 5.90 -2 -2.57 6.60 -4 -4.57 20.88 6 5.43 29.48 1 0.43 0.18 1 0.43 0.18 Σ = 4 Σ = 0 Σ = 65.68 (our SS)

s = √SS/df where SS = 65.68 and df = n-1 = 7-1 = 6

s = √65.68/6 = √10.94 = 3.308
Finally, we find the standard error (sD̅) taking s = 3.308 and n = 7.
s = 3.308/√7 = 1.25
We now have all the pieces needed to compute our confidence interval:
95% CI = ± t(s )
Upper Bound (UB) = 0.57 + 1.943(1.25) = 0.57 + 2.43 = 3.00
Lower Bound (LB) = 0.57 − 1.943(1.25) = 0.57 − 2.43 =  −1.86
95% CI = (LB, UB) = (−1.86, 3.00)

#### Step 4: Make the Decision

Remember that the confidence interval represents a range of values that seem plausible or reasonable based on our observed data. The interval spans -1.86 to 3.00, which includes 0, our null hypothesis value. Because the null hypothesis value is in the interval, it is considered a reasonable value, and because it is a reasonable value, we have no evidence against it. We fail to reject the null hypothesis.

Fail to Reject H0. Based on our focus group of 7 people, we cannot say that the average change in opinion ( = 0.57) was any better or worse after viewing the commercial, CI: (-1.86, 3.00).
It is optional to calculate effect size. Performing Cohen’s d = D̅/s = .57/3.308 = .17 which indicates a possible Type II error (very small sample size).As with before, we only report the confidence interval to indicate how we performed the test.
Assumptions of paired t-test

Assumptions are conditions that must be met in order for our hypothesis testing conclusion to be valid. [Important: If the assumptions are not met then our hypothesis testing conclusion is not likely to be valid. Testing errors can still occur even if the assumptions for the test are met.]

Recall that inferential statistics allow us to make inferences (decisions, estimates, predictions) about a population based on data collected from a sample. Recall also that an inference about a population is true only if the sample studied is representative of the population. A statement about a population based on a biased sample is not likely to be true.

Assumption 1: Individuals in the sample were selected randomly and independently, so the sample is highly likely to be representative of the larger population.

•        Random sampling ensures that each member of the population is equally likely to be selected.

•        An independent sample is one which the selection of one member has no effect on the selection of any other.

Assumption 2: The distribution of sample differences (DSD) is a normal, because we drew the samples from a population that was normally distributed.

• This assumption is very important because we are estimating probabilities using the t- table – which provide accurate estimates of probabilities for events distributed normally.

Assumption 3: Sampled populations have equal variances or have homogeneity of variance.

Advantages. Repeated measure designs reduce the probability of Type I errors when compared with independent sample designs because repeated measure t-tests reduce the probability that we will get a statistically significant difference that is due to an extraneous variable that differed between groups by chance (due to some other factor than the one in which we are interested).

Repeated measure designs are also more powerful (sensitive) than independent sample designs because two scores from each person are compared so each person serves as his or her own control group (we analyze the difference between scores). A special type of repeated measures design is known as the matched pairs design. If we are designing a study and suspect that there are important factors that could differ between our groups even if we randomly select and assign subjects, then we may use this type of design.

Because members of a matched-pair are similar to each other there is greater likelihood of our statistical test finding an “effect” when one person is present (power) in a repeated sample design as compared to a two-repeated sample design (in which subjects for two groups are picked randomly and independently – not matched on any traits).

Disadvantages. Repeated measure t-tests are very sensitive to outside influences and treatment influences. Outside Influences refers to factors outside of the experiment that may interfere with testing an individual across treatment/trials. Examples include mood or health or motivation of the individual participants. Think about it, if a participant tries really hard during the pretest but does not try very hard during the posttest, these differences can create problems later when analyzing the data.

Treatment Influences refers to the events that happen within the testing experience that interferes with how the data are collected. Three of the most common treatment influences are: 1. Practice effects, 2. Fatigue effects, and 3. Order effects.

Practice effect is present where participants perform a task better in later conditions because they have had a chance to practice it. Another type is a fatigue effect, where participants perform a task worse in later conditions because they become tired or bored. Order effects refer to differences in research participants’ responses that result from the order (e.g., first, second, third) in which the experimental materials are presented to them.

Imagine, for example, that participants judge the guilt of an attractive defendant and then judge the guilt of an unattractive defendant. If they judge the unattractive defendant more harshly, this might be because of his unattractiveness. But it could be instead that they judge him more harshly because they are becoming bored or tired. In other words, the order of the conditions is a confounding variable. The attractive condition is always the first condition and the unattractive condition the second. Thus any difference between the conditions in terms of the dependent variable could be caused by the order of the conditions and not the independent variable itself.

There is a solution to the problem of order effects, however, that can be used in many situations. It is counterbalancing, which means testing different participants in different orders. For example, some participants would be tested in the attractive defendant condition followed by the unattractive defendant condition, and others would be tested in the unattractive condition followed by the attractive condition. With three conditions, there would be six different orders (ABC, ACB, BAC, BCA, CAB, and CBA), so some participants would be tested in each of the six orders. With counterbalancing, participants are assigned to orders randomly, using the techniques we have already discussed. Thus random assignment plays an important role in within-subjects designs just as in between-subjects designs. Here, instead of randomly assigning to conditions, they are randomly assigned to different orders of conditions. In fact, it can safely be said that if a study does not involve random assignment in one form or another, it is not an experiment.

Because the repeated-measures design requires that each individual participate in more than one treatment, there is always the risk that exposure to the first treatment will cause a change in the participants that influences their scores in the second treatment that have nothing to do with the intervention.  For example, if students are given the same test before and after the intervention the change in the posttest might be because the student got practice taking the test, not because the intervention was successful.

### Learning Objectives

Having read this chapter, a student should be able to:

• identify when appropriate to calculate a paired or dependent t-test
• perform a hypothesis test using the paired or dependent t-test
• compute and interpret effect size for dependent or paired t-test
• list the assumptions for running a paired or dependent t-test

### Exercises – Ch. 12

1. What is the difference between a 1-sample t-test and a dependent-samples t– test? How are they alike?
2. Name 3 research questions that could be addressed using a dependent- samples t-test.
3. What are difference scores and why do we calculate them?
4. Why is the null hypothesis for a dependent-samples t-test always μD = 0?
5. A researcher is interested in testing whether explaining the processes of statistics helps increase trust in computer algorithms. He wants to test for a difference at the α = 0.05 level and knows that some people may trust the algorithms less after the training, so he uses a two-tailed test. He gathers pre- post data from 35 people and finds that the average difference score is 12.10 with a standard deviation (s) is 17.39. Conduct a hypothesis test to answer the research question.
6. Decide whether you would reject or fail to reject the null hypothesis in the following situations:
1. M𝐷̅ = 3.50, s = 1.10, n = 12, α = 0.05, two-tailed test
2. 95% CI = (0.20, 1.85)
3. t = 2.98, t* = -2.36, one-tailed test to the left
4. 90% CI = (-1.12, 4.36)
7. Calculate difference scores for the following data:
 Time 1 Time 2 XD or D 61 83 75 89 91 98 83 92 74 80 82 88 98 98 82 77 69 88 76 79 91 91 70 80

8. You want to know if an employee’s opinion about an organization is the same as the opinion of that employee’s boss. You collect data from 18 employee-supervisor pairs and code the difference scores so that positive scores indicate that the employee has a higher opinion and negative scores indicate that the boss has a higher opinion (meaning that difference scores of 0 indicate no difference and complete agreement). You find that the mean difference score is ̅𝑋̅̅𝐷̅ = -3.15 with a standard deviation of sD = 1.97. Test this hypothesis at the α = 0.01 level.

9. Construct confidence intervals from a mean = 1.25, standard error of 0.45, and df = 10 at the 90%, 95%, and 99% confidence level. Describe what happens as confidence changes and whether to reject H0.

10.A professor wants to see how much students learn over the course of a semester. A pre-test is given before the class begins to see what students know ahead of time, and the same test is given at the end of the semester to see what students know at the end. The data are below. Test for an improvement at the α = 0.05 level. Did scores increase? How much did scores increase?

 Pretest Posttest XD 90 8 60 66 95 99 93 91 95 100 67 64 89 91 90 95 94 95 83 89 75 82 87 92 82 83 82 85 88 93 66 69 90 90 93 100 86 95 91 96

### Answers to Odd- Numbered Exercises – Ch. 12

1. A 1-sample t-test uses raw scores to compare an average to a specific value. A dependent samples t-test uses two raw scores from each person to calculate difference scores and test for an average difference score that is equal to zero. The calculations, steps, and interpretation is exactly the same for each.

3. Difference scores indicate change or discrepancy relative to a single person or pair of people. We calculate them to eliminate individual differences in our study of change or agreement.
5. Step 1: H0: μ = 0 “The average change in trust of algorithms is 0”, HA: μ ≠ 0 “People’s opinions of how much they trust algorithms changes.”
Step 2: Two-tailed test, df = 34, t* = 2.032.
Step 3: ̅D = 12.10, s𝐷̅ = 2.94, t = 4.12.
Step 4: t > t*, Reject H0. Based on opinions from 35 people, we can conclude that people trust algorithms more (̅D = 12.10) after learning statistics, t(34) = 4.12, p < .05. Since the result is significant, we need an effect size: Cohen’s d = 0.70, which is a moderate to large effect.

7. See table last column.

 Time 1 Time 2 D or XD 61 83 22 75 89 14 91 98 7 83 92 9 74 80 6 82 88 6 98 98 0 82 77 -5 69 88 19 76 79 3 91 91 0 70 80 10

9. At the 90% confidence level, t* = 1.812 and CI = (0.43, 2.07) so we reject H0. At the 95% confidence level, t* = 2.228 and CI = (0.25, 2.25) so we reject H0. At the 99% confidence level, t* = 3.169 and CI = (-0.18, 2.68) so we fail to reject H0. As the confidence level goes up, our interval gets wider (which is why we have higher confidence), and eventually we do not reject the null hypothesis because the interval is so wide that it contains 0.