Description
INSTANT DOWNLOAD WITH ANSWERS
UNDERSTANDING STATISTICS IN THE BEHAVIORAL SCIENCES 10TH EDITION BY PAGANO – TEST BANK
CHAPTER 6
Correlation
LEARNING OBJECTIVES
After completing Chapter 6, students should be able to:
- Define, recognize graphs of and distinguish between the following: linear and curvilinear relationships, positive and negative relationships, direct and inverse relationships, perfect and imperfect relationships.
- Specify the equation of a straight line, and understand the concepts of slope and intercept.
- Define scatter plot, correlation coefficient, and Pearson r.
- Compute the value of Pearson r, and state the assumptions underlying Pearson r.
- Define the coefficient of determination (r^{2}); specify and explain an important use of r^{2}.
- List three correlation coefficients other than Pearson r, and specify the factors that determine which correlation coefficient to use; specify the effects on correlation of range and of an extreme score.
- Compute the value of Spearman rho (r_{s}) and specify the scaling of the variables appropriate for its use.
- Explain why correlation does not imply causation.
- Understand the illustrative examples, do the practice problems and understand the solutions.
DETAILED CHAPTER SUMMARY
- Relationships
- Linear Relationships.
- Scatter plots. A scatter plot is a graph of paired X (one variable score) and Y (another variable score) values. By visually examining the graph one can get a good idea of the nature of the relationship between the two variables (i.e., linear or not).
- Definitions.
- Linear relationship. A linear relationship between two variables is one in which the relationship between two variables can accurately be represented by a straight line.
b.. Curvilinear relationship. When a curved line fits a set of points better than a straight line it is called a curvilinear association or relationship.
- Straight Line Equation.
- General equation.
Y = bX + a
where a = the Y intercept and b = the slope of the line.
- Slope of the straight line equation (b). The slope tells us how much the Y score changes for each unit change in the X score. The slope is a constant value. In equation form:
b = slope = (Y2 – Y1)/(X2 – X1) .
- Y intercept (a). The Y intercept is the value of Y where the line intersects the Y axis. It is the value of Y when X = 0.
- Positive relationships. This indicates that there is a direct relationship between the variables. Higher values of X are associated with higher values of Y and vice versa.
- Negative relationships. This exists when there is an inverse relationship between X and Y. Low values of X are associated with high values of Y and vice versa.
- Perfect relationship. This occurs when all the pairs of points fall on a straight line.
- Imperfect relationships. This is when a positive or negative, relationship exists but all of the points do not fall on the line.
- Correlation Concepts
- Definition. Correlation is a measure of the direction and degree of relationship that exists between two variables.
- Correlation coefficient. Expresses quantitatively the magnitude and direction of the correlation.
- Range. Can range from +1 to -1.
- Sign. The sign of the coefficient tells us whether the relationship is positive or negative.
- Magnitude. The coefficient ranges from +1 to -1. Plus 1 is a perfect positive correlation, and minus 1 expresses a perfect negative relationship. A zero value of the correlation coefficient means there is no relationship between the two variables. Imperfect relationships vary between 0 and 1. They will be plus or minus depending on the direction of the relationship.
III. Pearson r
- Definition.
Pearson r is a measure of the extent to which paired scores occupy the same position within their own distributions. Standard scores allow us to examine the relative positions of variables independent of the units of measure.
- Calculating r.
- Computational formula from raw scores:
- Additional Interpretation for Pearson r
- Variability of Y. Pearson r can also be interpreted in terms of the variability of Y accounted for by X.
- r = 0. Where r = 0, knowledge of X does not help us predict Y. Best prediction of Y when r = 0 is .
- Deviation of Y_{i}. Distance between a given score Yi and the mean of Y scores is divisible into two parts.
Deviation of Yi = Error in prediction + deviation of Yi accounted for by X
(– ) = ( – )+( – )
- Total variability.
S (– )2 = S( – )2+S( – )2
As the relationship between X and Y gets stronger, the prediction error gets smaller causing S ( – )2 to decrease, and S (Y’ – )2 to increase.
- New definition of r. Pearson r equals the square root of the proportion of the total variability of Y accounted for by X.
- Explained Variability. The explained variability = r^{2}. For example, if r = .7 then .49 or 49% of the variability of Y is accounted for by X. This is called the explained variability. If X is causal with respect to Y, r^{2} is also a measure of the size of the effect.
- Other correlation coefficients besides r
- Eta(h). This is used for curvilinear relationships where Pearson r would underestimate the degree of relationship.
- Biserial correlation coefficient. Used when one variable is measured on an interval scale and the other variable is dichotomous.
- Phi (F) coefficient. Used when both variables are dichotomous.
- Spearman rank order correlation coefficient (rs), also called rho. Used when one or both variables are of ordinal scaling.
- Computational equation.
where Di = difference between ith pair of ranks
N = number of pairs of ranks
- Uses. When the data are not of either interval or ratio scaling but are of ordinal scaling, rs can be used.
- Correlation and Causation
- Causation. Correlation between X and Y does not prove causation.
- Explanations for correlation between X and Y.
- Correlation may be spurious
- X causes Y
- Y causes X
- Third variable causes the correlation between X and Y
- Role of experimentation. To establish that one variable is the cause of another, an experiment must be conducted by systematically varying only the causal variable and then measuring the effect on the other variable.
TEACHING SUGGESTIONS AND COMMENTS
This is a long chapter. However, It is usually well understood by students without your having to work too hard. Nevertheless, I suggest you stick to the examples in the textbook and follow the chapter material in your lectures. Students find calculating r from the computational equation given on p. 133 a little daunting, but if you go over a problem with them in class (the hypothetical problem introduced for this purpose on p. 134 using a small N and simple numbers seems to do the job reasonably well) and they work a couple of homework problems, they get used to the equation and it then no longer seems difficult, just a lot of work. Here is where the example using SPSS at the end of the chapter illustrates how convenient using SPSS is for doing calculations. I suggest going over the SPSS example, and making this point.
I think the most difficult material in the chapter is that dealing with A second interpretation for Pearson r, given on p. 137. For this material, I recommend that your lecture follows closely that of the textbook, using a visual (transparency, etc) of Figure 6.9 as you lecture. There are two main points of that section. The first is to understand the derivation and from that to understand that r can be defined in terms of the square root of the proportion of variability that is accounted for by X. This leads to the second point, namely that it is r^{2}, not r that indicates how important X might be as a causal factor in determining Y. To emphasize this point, I recommend showing table 6.6 and pointing out how little a correlation of 0.50 accounts for in the explained variability of Y. This is important because a major part of science is to determine causality and the uninitiated can be confused by using r instead of r^{2} to gauge the importance of a given variable.
I do not include the section on the Spearman rank order correlation coefficient rho (r_{s}), p. 141, in the course. There is not enough time for it, and in my opinion, it is not used in practice frequently enough to give it priority time. I cover everything else in the chapter.
DISCUSSION QUESTIONS
- It is sometimes said that the higher the correlation between two variables, the more likely the relationship is causal. Do you think this is correct? Discuss
- If two variables are measured on different scales with different units, e.g., the number of crimes committed and phases of the moon, how is it possible to derive a number, like a correlation coefficient, that expresses the relationship between the two variables? How is this done to produce the Pearson r correlation coefficient? Explain.
- John has noticed that people seem happier in the summer than in the winter and concludes that this is because most people take their vacations in the summer. Is John justified in drawing this conclusion based on this reason? Discuss.
- Assume you are an educational psychologist and you believe that there is a relationship between socioeconomic factors and academic performance. You are planning to conduct a study to investigate this belief. If such a relationship really does exist, to demonstrate the relationship would you be better off to collect data only on students from the wealthiest families or on students from all levels of the socioeconomic range? Discuss.
- Assume you are a wealthy philanthropist. You want to contribute to help the children of single parents do better in elementary school. You enlist the aid of an educator who advises you to fund an organization that will provide tutors to help the children learn. Pilot work has shown a correlation of 0.30 between tutoring and increased performance in elementary school. Would you follow the advice of the educator? Discuss.
TEST QUESTIONS
Multiple Choice
- Correlation and regression differ in that _________.
- correlation is primarily concerned with the size and direction of relationships
- regression is primarily used for prediction
- both a and b are true
- neither a nor b are true
- A scatter plot _________.
- has to do with electron scatter
- is a graph of paired X and Y values
- must be linear
- is a frequency graph of X values
- If a relationship is linear, _________.
- the relation can be most accurately represented by a straight line
- all the points fall on a curved line
- the relationship is best represented by a curved line
- all the points must fall on a straight line
- In the equation Y = bX + a, X and Y are _________.
- constants
- statistics
- population parameters
- variables
- In the equation Y = bX + a, X and Y, b is _________.
- a constant
- the slope of the line
- the Y axis intercept
- a and b
- a variable
- In the equation Y = bX + a, X and Y, a is _________.
- a constant giving the value of the Y axis intercept
- a constant giving the value of the slope of the line
- a variable relating X to Y
- a variable relating Y to X
- In a positive relationship, _________.
- as X increases, Y increases
- as X decreases, Y decreases
- a and b
- as X increases, Y decreases
- In a negative relationship, _________.
- as X increases, Y increases
- as X decreases, Y decreases
- a and b
- as X increases, Y decreases
- In a positive relationship, _________.
- b is negative
- b is positive
- a must be positive
- a must be negative
- In a negative relationship, _________.
- b is positive
- b can be either positive or negative
- a must be negative
- b is negative
- In a perfect relationship, _________.
- all the points fall on the line
- none of the points fall on the line
- some of the points fall on the line
- the points form an ellipse around the line
- In an imperfect relationship, _________.
- all the points fall on the line
- a relationship exists, but all of the points do not fall on the line
- no relationship exists
- a relationship exists, but none of the points can fall on the line
- A relationship can be _________.
- perfect
- imperfect
- nonexistant
- a, b and c
- The closer the points on a scatter diagram fall to the regression line, the _________ between the scores.
- a. higher the correlation
- lower the correlation
- correlation doesn’t change
- need more information
- Which of the following is(are) correct interpretation(s) of correlation? Correlation _________.
- indicates the degree of the relationship between two variables
- indicates a causal relationship between two variables
- is useful in deciding which variables to manipulate in an experimental study
- a and b
- e. a and c
- a, b and c
- The lowest degree of correlation shown below is _________.
- 0.75
- -0.33
- -0.25
- d. 15
- The correlation coefficient between heights from the ground of two people on the opposite ends of a seesaw would be _________.
- 1.0
- 0
- c. -0
- cannot tell without further information
- If the correlation between variables X and Y is 0.95, which of the following is true?
- X is a cause of Y
- Y is a cause of X
- low scores on X are accompanied by high scores on Y
- d. high scores on X are accompanied by high scores on Y
- a and d
- Y can be most accurately predicted from X if the correlation between X and Y is _________.
- 0.80
- 0.00
- 0.45
- d. -98
- Which Pearson correlation coefficient shows the strongest relationship between two variables?
- a. -80
- 0.00
- 0.75
- 0.20
- 0.03
- Knowing nothing more than that IQ and memory scores are correlated 0.84, you could validly conclude that _________.
- good memory causes high IQ
- high IQ causes good memory
- neither good memory nor high IQ cause each other
- a third variable causes both good memory and high IQ
- e. none of the above
- When deciding which measure of correlation to employ with a specific set of data, you should consider _________.
- whether the relationship is linear or nonlinear
- type of scale of measurement for each variable
- c. a and b
- none of the above
- The proportion of variance accounted for by a correlation between two variables is determined by _________.
- Y2
- b. r2
- r
- b
- Which of the following statements is true?
- Correlation implies causation.
- b. Causation implies correlation.
- neither a nor b
- both a and b
- A correlation between college entrance exam grades and scholastic achievement was found to be -1.08. On the basis of this you would tell the university that _________.
- the entrance exam is a good predictor of success
- b. they should hire a new statistician
- the exam is a poor predictor of success
- students who do best on this exam will make the worst students
- students are this school are underachieving
- It is possible to compute a coefficient of correlation if one is given _________.
- a single score
- b. two sets of measurements on the same individuals
- 50 scores of a clerical aptitude test
- all of the above
- none of the above
- After several studies, Professor Smith concludes that there is a zero correlation between body weight and bad tempers. This means that _________.
- heavy people tend to have bad tempers
- skinny people tend to have bad tempers
- no one has a bad temper
- everyone has a bad temper
- e. a person with a bad temper may be heavy or skinny
- Which of the following statements concerning Pearson r is not true?
- r = 0.00 represents the absence of a relationship.
- b. The relationship between the two variables must be nonlinear.
- r = 0.76 has the same predictive power as r = -0.76.
- r = 1.00 represents a perfect relationship.
- All of the above are true statements.
- If the correlation between two variables is -1.00 and the score of a given individual is 2.20 standard deviations above the mean on one of the variables, we would predict a score on the second variable of _________.
- a. 20 standard deviations below the mean
- 2.20 standard deviations above the mean
- more than 2.20 standard deviations above the mean
- more than 2.20 standard deviations below the mean
- Which of the following is (are) not correct interpretations of Pearson r?
- a. ratio of the variability of Y to the variability of X
- measure of extent to which paired scores occupy the same or opposite positions within their own distributions
- difference between the variability of Y and the variability of X
- square root of the proportion of the total variability of Y accounted for by X
- a and c
- Which of the following is (are) not correlation coefficients?
- Pearson r
- eta
- rho
- phi
- e. they all are correlation coefficients
- Rho is used _________.
- when both variables are dichotomous
- when both variables are of interval or ratio scaling
- c. when one or both variables are only of ordinal scaling
- when the data is nonlinear
- When a correlation exists, lowering the range of either of the variables will _________.
- raise the correlation
- b. lower the correlation
- not change the correlation
- produce a causal relationship
- A traffic safety officer conducted an experiment to determine whether there is a correlation between people’s ages and driving speeds. Six individuals were randomly sampled and the following data were collected.
Age | 20 | 25 | 45 | 46 | 60 | 65 |
Speed (mph) | 60 | 47 | 55 | 38 | 45 | 35 |
The value of Pearson r equals _________.
- -0.82
- -0.70
- -0.63
- +0.70
- A traffic safety officer conducted an experiment to determine whether there is a correlation between people’s ages and driving speeds. Six individuals were randomly sampled and the following data were collected.
Age Y | 20 | 25 | 45 | 46 | 60 | 65 |
Speed X (mph) | 60 | 47 | 55 | 38 | 45 | 35 |
The proportion of variability of Y accounted for by X is _________.
- 0.49
- 0.67
- 0.40
- -0.49
- A researcher wanted to know if the order in which runners finish a race is correlated with their weight. She conducts an experiment and the data are given below.
Finishing order | 1 | 2 | 3 | 4 | 5 | 6 |
Weight (lbs) | 110 | 114 | 112 | 108 | 116 | 113 |
What is the appropriate correlation coefficient for these data?
- r
- b. rho
- phi
- biserial
- A researcher wanted to know if the order in which runners finish a race is correlated with their weight. She conducts an experiment and the data are given below.
Finishing order | 1 | 2 | 3 | 4 | 5 | 6 |
Weight (lbs) | 110 | 114 | 112 | 108 | 116 | 113 |
The correlation for these data equals _________.
- a. 31
- 0.32
- 0.41
- 0.45
38 If N is small, an extreme score _________.
- won’t affect robtunduly
- should be thrown out
- might have a large effect on robt
- has no effect on the value of robt
- Which of the following values of r represents the strongest degree of relationship between two variables?
- 0.55
- 0.00
- 0.78
- -0.80
- What is the slope for the points X1= 30, Y1= 50 and X2 = 25 and Y2 = 40?
- 2.00
- 0.50
- -2.00
- -0.50
- In order for the correlation coefficient to be negative, which of the following must be true?
- S XY > (S X)( S Y)/N
- S XY < (S X)( S Y)/N
- S XY = (S X)( S Y)/N
- S XY must be zero
- If two variables are ratio scaled and the relationship is linear, what type of correlation coefficient is most appropriate?
- Pearson r
- Spearman rho
- eta
- phi
- Correlation implies causation.
- true
- false
- Causation implies correlation.
- true
- false
- Pearson r can be properly used on which of the following type(s) of relationships?
- linear
- curvilinear
- exponential
- all of the above
- If one takes a sample of pairs of points over a narrow range of X or Y scores, what effect might this have on the value of r?
- inflate r
- have no effect on r
- reduce r
- cannot be determined
- You have conducted a brilliant study which correlates IQ score with income and find a value of r = 0.75. At the end of the study you find out all the IQ scores were scored 10 points too high. What will the value of r be with the corrected data?
- r will be increased
- r will be decreased
- r will remain the same
- cannot be determined
- If zXequals zY for each pair of points, r will equal _______.
- 0.00
- -1.00
- 1.00
- 0.50
- If one calculates r for raw scores, and then calculates r on the z scores of the same data, the value of r will _______.
- stay the same
- decrease
- increase
- equal 1.00
- You have noticed that as people eat more ice cream they also have darker suntans. From this observation, you conclude _______.
- eating ice cream causes people to tan darker
- when one’s skin tans it causes an urge to eat ice cream
- the results were spurious
- perhaps a third variable is responsible for the correlation
- all of the above are possible
- If 49% of the total variability of Y is accounted for by X, what is the value of r?
- 0.49
- 0.51
- 0.70
- 0.30
- What is the value of r for the following relationship between height and weight?
Height | 60 | 64 | 65 | 68 |
Weight | 103 | 122 | 137 | 132 |
- 0.87
- 0.76
- 0.93
- 0.56
- For the following X and Y scores, how much of the variability of Y is accounted for by knowledge of X? Assume a linear relationship.
X | 20 | 15 | 6 | 10 |
Y | 6 | 5 | 4 | 0 |
- 68%
- 34%
- 58%
- 27%
- What is the value of the Spearman rank order correlation coefficient (rho) for the following pairs of ranks?
- 0.40
- 0.50
- 0.60
- 0.70
- In order to properly use rho, the variables must be of at least _______ scaling.
- nominal
- ordinal
- interval
- ratio
- Pearson r is _______.
- a measure of the extent to which paired scores occupy the same or opposite positions within their own distributions.
- the square root of the proportion of the variability of Y that is accounted for by X.
- used when both variables are of interval or ratio scaling.
- All of the above are true.
- A correlation of r = 0.60 exists between a set of X and Y scores. If a constant of 10 is added to each score of both distributions, the value of r will _______.
- remain the same
- will increase
- decrease
- be less meaningful
- b and d
- If a correlation is perfect,
- all the points must fall on a straight line
- all the points must fall on a curve line
- most the points must fall on the line, but some can miss it.
- all the points must fall on a straight or curved line,
True/False
- For a linear relationship to exist, all the points must fall on a straight line.
- In a positive relationship as X increases, Y increases.
- The slope of the line reveals whether the relationship is positive or negative.
- The farther away the points on a scatter diagram fall from the regression line, the lower the correlation.
- Correlation implies causation.
- Causation implies correlation.
- Given a -1.00 correlation coefficient, a raw score of 32 on one measure must be accompanied by a score of -32 on the corresponding second measure.
- As the value of r increases, the proportion of variability of Y that is accounted for by X decreases.
- r^{2} is called the coefficient of determination.
- Correlation deals with the relationship between two variables.
- A correlation coefficient expresses quantitatively the degree of relationship between two variables.
- In a perfect positive correlation, each individual obtains the same z score on each variable.
- The range of a correlation coefficient is 0 to +1.
- The use of z scores allows comparisons between variables measured on different scales and units.
- Pearson r requires that the data be of interval or ratio scaling.
- If the relationship is imperfect, the value of the correlation coefficient must be negative.
- Rho is used where one or both variables are at least of interval scaling.
- A scatter plot is used to help determine if the relationship is linear or curvilinear.
- Spearman rho really derives from Pearson r.
- In order to compute r, we must first convert each score to its z score and then do our calculations with the z scores.
- Assuming a correlation exists, as the range of one of the variables decreases, r increases.
- r decreases as N decreases.
- The easiest way to determine if a relationship is linear is to calculate the regression line.
- In a linear relationship all the points must fall on a straight line.
- In a perfect linear relationship all the points must fall on a straight line.
- The slope of a line is a measure of its rate of change.
- In a straight line the slope approaches zero as the line comes near the point X, Y.
- In an inverse relationship as one variable gets larger the other variable gets smaller.
- A correlation coefficient expresses the direction but not the magnitude of a relationship.
- Both Pearson r and Spearman rho can range from -1.00 to +1.00.
- The value of r obtained by calculating the correlation between X and Y is the same as the correlation between Y and X.
- If scores are z scores and if r equals 1 then zXwill always equal zY.
- One reason for calculating r from z scores is to make r independent of units and scaling.
- The coefficient of determination equals the proportion of variability accounted for by the relationship between the variables.
- The coefficient of determination equals.
- Since r is so widely used, it is appropriate to calculate r for nonlinear data.
- The formula for rho is actually just the formula for Pearson’s r simplified to apply to lower order scaling.
- If the value of rho for a set of ordinal data equaled 0.68, the value of r for the same data would be 0.68.
- If the value of r on ratio scaled raw data were 0.87 and the pairs of numbers were converted to ordinal data, and r calculated for the ordinal data, r would equal 0.87.
- Restricting the range of either X or Y will generally lower the correlation between the variables.
- If one calculates r for a set of data and r equals 0.84, one can be certain that the relationship between the variables is not spurious.
- The correlation between two variables when N = 2 will always be perfect.
- The correlation coefficient when N = 2 is meaningless.
- If r = -1.00, the relationship is imperfect.
- If one calculates r for a set of numbers and then adds a constant to each value of one of the variables, the correlation will change.
- If the standard deviation of one of the variables equals zero, r cannot be calculated.
- The correlation coefficient r is a descriptive statistic.
Short Answer
- Define biserial coefficient.
- Define coefficient of determination.
- Define correlation.
- Define correlation coefficient.
- Define curvilinear relationship.
- Define direct relationship.
- Define imperfect relationship.
- Define Inverse relationship.
- Define negative relationship.
10 Define Pearson r.
- Define perfect relationship.
- Define phi coefficient.
- Define positive relationship.
- Define Scatter plot.
- Define slope.
- Define spearman rho.
- Define variability accounted for by X.
- Define Y intercept.
- Give two definitions of Pearson r.
- Why is r2a useful statistic?
- Why are z scores used, instead of raw scores, as the basis for determining Pearson r? Explain.
- If the N is small, can an extreme score cause problems in interpreting the size of the relationship? Explain
- When two variables are correlated, there are four possible explanations of the relationship. What are they?
- What factors influence the choice of whether to use a particular correlation coefficient?
- Using the following table, construct a scatter plot of the pairs of values for height and weight. Calculate the correlation between height and weight.
- The phone rings and the President is calling you. He says, “I hear you are studying correlation. Can you please tell me how much of the variability of my popularity rating is explained by the inflation rate?” He kindly supplies you with the following data:
Inflation rate | 10 | 12 | 8 | 6 | 14 |
Popularity rating | 70 | 75 | 85 | 84 | 48 |
Please furnish the President with the information he desires.
- An ornithologist wants to know how much of a relationship there is between the weight gain of the female flicker during the spring and the number of eggs laid. What is the Pearson r for the data the ornithologist gathered?
Bird | 1 | 2 | 3 | 4 | 5 | 6 |
Weight Gain (oz.) | 2 | 6 | 6 | 3 | 10 | 7 |
Number of Baby Flickers | 0 | 1 | 3 | 3 | 6 | 5 |
- An ornithologist wants to know how much of a relationship there is between the weight gain of the female flicker during the spring and the number of eggs laid. What is the Pearson r for the data the ornithologist gathered?
Bird | 1 | 2 | 3 | 4 | 5 | 6 |
Weight Gain (oz.) | 2 | 6 | 6 | 3 | 10 | 7 |
Number of Baby Flickers | 0 | 1 | 3 | 3 | 6 | 5 |
How much of the variability in the number of eggs is accounted for by the amount of weight gained?
- A teacher was interested in knowing if leadership ability was correlated with attractiveness in third graders. The teacher ranked a group of students for leadership ability and had another teacher rank the same children on attractiveness.
- What level of scaling is involved in this problem?
- What correlation coefficient is appropriate for this type of data?
- Using the following data and table as an aid, what is the value for rs for these data?
- Calculate the Pearson r for the following set of data:
X | 1 | 2 | 3 | 4 |
Y | 10 | 13 | 14 | 16 |
- Using the following set of data:
X | 1 | 2 | 3 | 4 |
Y | 16 | 14 | 13 | 10 |
- Calculate the value of Pearson r.
- What is the difference between the data of problem 30 and problem 31? How does it relate to Pearson r?
- Given the following set of data for six subjects:
Subject No. |
1 |
2 |
3 |
4 |
5 |
6 |
X |
6 | 9 | 7 | 7 | 5 | 15 |
Y |
9 | 7 | 10 | 8 | 6 | 17 |
- Construct a scatter plot of the data.
- Calculate Pearson r for the first 5 subjects.
- Next, add the data of subject 6, and recalculate Pearson r for all six subjects.
- Explain the difference between the values of r for part b and part c.
- Given the following set of data for 8 subjects:
Subject No |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
X | 5 | 8 | 6 | 10 | 9 | 7 | 6 | 7 | 9 | 10 |
Y | 7 | 9 | 7 | 12 | 8 | 8 | 9 | 10 | 11 | 10 |
- Construct a scatter plot of the data.
- Calculate Pearson r.
- Remove the data for subjects 1, 3, 4, and 10. Recalculate r.
- Explain the difference in the values of r obtained for part b and part c.
- A study is conducted to determine the reliability of two judges in assessing musical performance. The judges are asked to rate 8 musical contestants on a twenty point scale. The higher the rating, the greater is the assessed musical talent. The following data
Contestant No |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
Judge A | 18 | 16 | 12 | 10 | 17 | 15 | 13 | 14 |
Judge B | 16 | 13 | 14 | 7 | 18 | 11 | 9 | 17 |
Assuming the data are only of ordinal scaling, use the appropriate correlation coefficient to assess how alike the judges are in their ratings.
CHAPTER 7
Linear Regression
LEARNING OBJECTIVES
After completing Chapter 7, students should be able to:
- Define regression, regression line, and regression constant.
2 Specify the relationship between strength of relationship and prediction accuracy.
- Construct the least-squares regression line for predicting Y given X; specify what the least-squares regression line minimizes; and specify the convention for assigning X and Y to the data variables.
- Explain what is meant by standard error of estimate; state the relationship between errors in prediction and the magnitude of s_{Y|X}; define homoscedasticity and explain its use.
- Specify the condition(s) that must be met to use linear regression.
- Specify the relationship between regression constants and Pearson r.
- Explain the use of multiple variables and their relationship to prediction accuracy.
- Compute R^{2} for two variables; specify what R^{2} stands for and what it measures.
- Understand the illustrative examples, do the practice problems and understand the solutions.
DETAILED CHAPTER SUMMARY
- Introduction.
- Linear regression. This topic deals with predicting scores of one distribution using information known about scores of a second distribution. For example, one might predict your height if they knew your weight and the nature of your relationship between height and weight from a sample of other people.
- Correlation. This refers to the magnitude and direction of the relationship between two variables.
- Least-Squares Regression Line for Prediction.
- Least-squares criterion. In an imperfect relationship no single straight line will hit all the points. We pick the line that will minimize the total errors of prediction, i.e., construct the one line that minimizes S (Y – Y’)2 where Y’ is the predicted value of Y for any value of X.
- Constructing the regression line of Y on X.
- Equation.
where
- Use of regression equation. For a given value of X, simply plug that value in the equation and solve for Y’ using the regression constants bY and aY. Note that it is customary to label the variable to which we are predicting as the Y variable, and the variable we are predicting from as the X variable.
–
III. Prediction Errors. When relationships between X and Y variables are imperfect, there will be prediction errors.
- Standard error of estimate (). Quantifying the magnitude of the error involves computing the standard error of estimate symbolized . The standard error is much like the standard deviation.
- Definition. Gives a measure of the average deviation of the prediction errors about the regression line.
- Equation for standard error of estimate.
- Interpretation. The larger the value of , the less confidence one has in the prediction of Y given X. The smaller the value of , the more likely the prediction will be accurate. If one constructed two parallel lines to the regression line at distances of ±1, ±2, and ±3, one would find about 68%, 95%, and 99% of the scores would fall between the lines respectively.
- Other errors. One must be careful of sources of errors in making predictions. There are two major considerations in making predictions.
- Linearity. The original relationship needs to be linear for accurate prediction using linear regression.
- Prediction in the range. Generally one uses a sample to generate the data for calculating the regression constants (bY and aY). Predictions of Y should be based on values of X within the range of the sample upon which the constants are based.
- Regression Constants and Pearson r
- Regression coefficient. bY = r(sY/sX)
- Regression constant aY. Found in the usual way
- Slope of regression line for z scores. Equals r.
- Multiple Regression
- Extension of simple regression. Multiple regression is an extension of simple regression (single predictor) to situations that involve two or more predictor variables.
- Prediction accuracy. Increases accuracy of prediction.
- Equation for two predictor variables.
- Multiple coefficient of determination, R^{2}. R^{2} = multiple coefficient of determination = squared multiple correlation.
- Equation of R2 for two predictor variables.
TEACHING SUGGESTIONS AND COMMENTS
The regression chapter is about at the same difficulty level as the correlation chapter. It makes sense to teach regression directly after correlation, since both rely on the same data and relationship. Since students are now familiar with the computational equation for calculating r, they are not as disturbed by the equation for computing the slope coefficient b_{Y} as they were initially with r. However, it still requires a fair amount of work, and accuracy can be a problem. Computing Pearson r and b_{Y} are good examples of why on exams I prefer to present computational questions in computational form, and leave enough room right under the question for students to show all their work in the provided space. Then it is possible to trace their work and give them the partial credit it deserves, rather than all or none grading as happens with multiple choice questions. Of course if the class is large, without adequate TA grading help, it is may not be possible to do this.
The difficulty level of the material in this chapter is such that I recommend you use the same examples as are given in the textbook. I also recommend that you make slides, or overheads, or download the set of overheads that are available on the web for many of the figures and tables that are in this chapter. Using these visual presentations when lecturing on the material helps a lot for the material in this chapter.
The section titled, Relation between regression constants and Pearson r, p. 172-174 can be omitted, if time is short. This section is included to make the theoretical point that both Pearson r and the b_{Y} coefficient are slopes of regression lines, Pearson r being the slope of the least-squares regression line when the raw scores are expressed as z scores. In fact, you could define Pearson r in this way. I think of this as an important theoretical point, but one that is not as important as the other material and therefore can be omitted if time doesn’t permit. This section also is included to show the quantitative relationship between r and b_{Y}. I like including this relationship in my lectures, but I believe it is of lesser importance than the rest of the material. Aside from these two topics, I usually lecture on the rest of the material in the chapter. The chapter seems to work well, so I recommend you follow it in your lecture(s).
DISCUSSION QUESTIONS
- Define regression line. What does the regression emphasized in the textbook minimize? Is this the only regression line that it is might be desirable to construct? What other one might be desirable? Suggest some situations that might be appropriate for each and discuss.
- If there is no relationship between the X and Y variables and we desire to predict Y given X using a least-squares criterion, it is best to predict for every Y score. Is this correct? If so, explain why. (Hint: one of the properties of the mean might be helpful here)
- Using two predictor variables, under what two conditions would R^{2} = r^{2}? If either or both of these conditions did exist, what would be the gain in prediction accuracy by using the second predictor variable? Explain.
- Using the least-squares regression line for predicting Y given X minimizes the error of prediction for each score. Is this true? Explain.
- A friend that thinks a lot about statistics asserts that, “the closer the points in the scatter plot are to the least-squares regression line, the higher the correlation.” Is your friend correct? Discuss.
- Explain the convention that is used to assign X and Y to the data variables.
TEST QUESTIONS
Multiple Choice Questions
- The primary reason we use a scatter plot in linear regression is _________.
- to determine if the relationship is linear or curvilinear
- to determine the direction of the relationship
- to compute the magnitude of the relationship
- to determine the slope of the least squares regression line
- When the relation between X and Y is imperfect, the prediction of Y given X is _________.
- perfect
- always equal to Y
- impossible to determine
- d. approximate
- The regression equation most often used in psychology minimizes _________.
- S (Y – Y’)
- S (Y – Y’)2
- S (Y – X)2
- none of the above
- The regression of Y on X _________.
- predicts X given Y
- predicts X‘ given X
- predicts Y given X
- predicts Y given Y‘
- For regression purposes,
- X is assigned to the variable being predicted
- Y is assigned to the variable being predicted
- It doesn’t matter whether X or Y is assigned to the variable being predicted
- none of the above
- If the correlation between two sets of scores is 0 and one had to predict the value of Y for any given value of X, the best prediction of Y would be _________.
a.
- 0
- During the past 5 years there has been an inflationary trend. Listed below is the average cost of a gallon of milk for each year.
1981 | 1982 | 1983 | 1984 | 1985 |
$1.10 | $1.23 | $1.30 | $1.50 | $1.65 |
Assuming a linear relationship exists, and that the relationship continues unchanged through 1986, what would you predict for the average cost of a gallon of milk in 1986?
- $1.77
- $1.72
- $1.70
- $1.83
- A researcher collects data on the relationship between the amount of daily exercise an individual gets and the percent body fat of the individual. The following scores are recorded.
Individual | 1 | 2 | 3 | 4 | 5 |
Exercise (min) | 10 | 18 | 26 | 33 | 44 |
% Fat | 30 | 25 | 18 | 17 | 14 |
Assuming a linear relationship holds, the least squares regression line for predicting % fat from the amount of exercise an individual gets is _________.
- Y’ = 0.476X + 33.272
- Y’ = 1.931X + 66.363
- Y’ = -0.476X + 33.272
- Y’ = -0.432X + 32.856
- A researcher collects data on the relationship between the amount of daily exercise an individual gets and the percent body fat of the individual. The following scores are recorded.
Individual | 1 | 2 | 3 | 4 | 5 |
Exercise (min) | 10 | 18 | 26 | 33 | 44 |
% Fat | 30 | 25 | 18 | 17 | 14 |
Based on the above data, if an individual exercises 20 minutes daily, his predicted % body fat would be _________.
- 21.63
- 27.74
- 27.88
- 23.75
- A researcher collects data on the relationship between the amount of daily exercise an individual gets and the percent body fat of the individual. The following scores are recorded.
Individual | 1 | 2 | 3 | 4 | 5 |
Exercise (min) | 10 | 18 | 26 | 33 | 44 |
% Fat | 30 | 25 | 18 | 17 | 14 |
The least squares regression line for predicting the amount of exercise from % fat is _________.
- X’ = -1.931Y + 66.363
- X’ = -0.476Y + 33.272
- X’ = 1.931Y + 66.363
- X’ = -1.905Y + 62.325
- A researcher collects data on the relationship between the amount of daily exercise an individual gets and the percent body fat of the individual. The following scores are recorded.
Individual | 1 | 2 | 3 | 4 | 5 |
Exercise (min) | 10 | 18 | 26 | 33 | 44 |
% Fat | 30 | 25 | 18 | 17 | 14 |
If an individual has 22% fat, his predicted amount of daily exercise is _________.
- 22.80
- 23.88
- 24.76
- 20.22
- A researcher collects data on the relationship between the amount of daily exercise an individual gets and the percent body fat of the individual. The following scores are recorded.
Individual | 1 | 2 | 3 | 4 | 5 |
Exercise (min) | 10 | 18 | 26 | 33 | 44 |
% Fat | 30 | 25 | 18 | 17 | 14 |
The value for the standard error of estimate in predicting % fat from daily exercise is _________.
- 3.35
- 4.32
- 2.14
- 1.66
- none of above
- The assumption of homoscedasticity is that _________.
- the range of the Y scores is the same as the X scores
- the X and Y distributions have the same mean values
- the variability of Y doesn’t change over the X scores
- the variability of the X and Y distributions is the same
- You go to a carnival and a sideshow performer wants to bet you $100 that he can guess your exact weight just from knowing your height. It turns out that there is the following relationship between height and weight.
Height (in) | 60.0 | 62.0 | 63.0 | 66.5 | 73.5 | 84.0 |
Weight (lbs) | 99 | 107 | 111 | 125 | 153 | 195 |
Should you accept the performers bet? Explain.
- yes
- need more information
- c. no
- yes, if he measures my height in centimeters
- If r = 0.4582, sY= 3.4383, and sX = 5.2165, the value of bY = _________.
- 0.695
- 0.458
- 0.302
- 1 – 0.458
- none of the above
- In multiple regression, if the second predictor variable correlates highly with the predicted variable, than it is quite likely that _________.
- R2 = 1.00
- R2 > r2
- R2 = r2
- R2 < r2
- If the relationship between X and Y is perfect:
- r = b_{Y}
- r ≠ b_{Y}
- prediction is approximate
- a and c
- all of the above
- When predicting Y, adding a second predictor variable to the first predictor variable X, will _______.
- always increase prediction accuracy
- increase prediction accuracy depending on the relationship between the second predictor variable and X
- Increase prediction accuracy depending on the relationship between the second predictor variable and Y
- b and c
- The higher the standard error of estimate is,
- the more accurate the prediction is likely to be
- the less accurate is the prediction is likely to be
- the less confidence we have in the accuracy of the prediction
- the more confidence we have in the accuracy of the prediction
- a and d
- b and c
- If = 0.0 the relationship between the variables is _________.
- perfect
- imperfect
- curvilinear
- unknown )
- S (Y – Y’) equals _________.
- 0
- 1
- cannot be determined from information given
- who cares
- S (Y – Y’)2represents _________.
- the standard deviation
- the variance
- the standard error of estimate
- the total error of prediction
- In a particular relationship N = 80. How many points would you expect on the average to find within ±1 of the regression line?
- 40
- 80
- 54
- 0
- What would you predict for the value of Y for the point where the value of X is ?
- cannot be determined from information given
- 0
- 1
- If the value of = 4.00 for relationship A and = 5.25 for relationship B, in which relationship would you have the most confidence in a particular prediction?
- A
- B
- it makes no difference
- cannot be determined from information given
- If bYis negative, higher values of X are associated with _________.
- lower values of X’
- higher values of Y
- higher values of (Y – Y’)
- lower values of Y
- Which of the following statement(s) is (are) an important consideration(s) in applying linear regression techniques?
- the relationship should be linear
- both variables must be measured in the same units
- predictions for Y should be within the range of the X variable in the sample
- a and c
- In the regression equation Y’ = X, the Y-intercept is _________.
- 0
- 1
- If the value for aYis negative, the relationship between X and Y is _________.
- positive
- negative
- inverse
- cannot be determined from information given
- If bY= 0, the regression line is _________.
- horizontal
- vertical
- undefined
- at a 45° angle to the X axis
- The least-squares regression line minimizes _________.
- s
- S (Y – )2
- S (Y – Y’)2
- b and d
- The points (0,5) and (5,10) fall on the regression line for a perfect positive linear relationship. What is the regression equation for this relationship?
- Y’ = X + 5
- Y’ = 5X
- Y’ = 5X + 10
- cannot be determined from information given.
- For the following points what would you predict to be the value of Y’ when X = 19? Assume a linear relationship.
X | 6 | 12 | 30 | 40 |
Y | 10 | 14 | 20 | 27 |
- 16.35
- 24.69
- 22.00
- 17.75
- If N = 8, S X = 160, S X2= 4656, S Y = 79, S Y2 = 1309, and S XY = 2430, what is the value of bY?
- 0.9217
- -1.8010
- 0.5838
- 0.7922
- If X and Y are transformed into z scores, and the slope of the regression line of the z scores is -0.80, what is the value of the correlation coefficient?
- -0.80
- 0.80
- 0.40
- -0.40
- If the regression equation for a set of data is Y’ = 2.650X + 11.250 then the value of Y’ for X = 33 is _________.
- 87.45
- 371.25
- 98.70
- 76.20
- If = 57.2, = 84.6, and bY = 0.37, the value of aY = _________.
- 141.80
- -25.90
- 63.44
- 27.40
- If sY= sX = 1 and the value of bY = 0.6, what will the value of r be?
- 0.36
- 0.60
- 1.00
- 0.00
- When using more than one predictor variable, _________ tells us the proportion of variance accounted for by the predictor variables.
- r
- SSX
- SSY
- R^{2}
- Which of the following statements is(are) false?
- b_{Y} is the slope of the line for minimizing errors in predicting Y.
- a_{Y} is the Y axis intercept for minimizing errors in predicting Y.
- s_{YIX} is the standard error of estimate for predicting Y given X.
- All of the above statements are true.
- R^{2} is the multiple coefficient of nondetermination.
- The regression coefficient b_{Y} and the correlation coefficient r, _________.
- necessarily increase in magnitude as the strength of relationship increases
- are both slopes of straight lines
- are not related
- will equal each other when the variability of the X and Y distributions are equal
- b and d
- When predicting Y given X, _________.
- the prediction is valid only within the range of X
- the variability of the Y values over the range of the X values should be the same
- the representativeness of the sample used to derive the regression line is an important consideration
- a, b, and c
- a and c
- When predicting Y from two variables relative to using only one variable, _________.
- prediction accuracy always increases
- prediction accuracy is dependent on the relationship between the second variable and the Y variable
c increase in prediction accuracy depends on the correlation between the two predictor variables
- b and c
- There is ________ between the s_{Y}_{½}_{X }and r.
- a direct relationship
- an inverse relationship
- no relationship
- animosity
- The regression coefficient for predicting Y given X is symbolized by _______
- b_{Y}
- a_{Y}
- b_{X}
- a_{X}
- The regression constant for predicting Y given X is symbolized by _________.
- b_{Y}
- a_{Y}
- b_{X}
- a_{X}
- The symbol for the standard error of estimate when predicting Y given X is _________.
- r_{X}_{½}_{Y}
- s_{X}_{½}_{Y}
- r_{Y}_{½}_{X}
- s_{Y}_{½}_{X}
True/False
- The total error in prediction equals S (Y – Y’).
- When doing regression, it is customary to assign X to the predicted variable.
- An imperfect relationship generally yields exact prediction.
- When the relationship is perfect, the regression of Y on X is the same as the regression of X on Y.
- Properly speaking, we should limit our predictions to the range of the base data.
- The least squares regression line insures the maximum number of direct hits.
- To do linear regression, there must be paired scores on two variables.
- If the standard deviations of the X and Y distributions are equal, then r = bY.
- If sX = sY then r = bY.
- The higher the r value, the lower the standard error of estimate.
- Multiple regression uses more than one predictor variable.
- Multiple regression always results in greater prediction accuracy than simple regression.
- If the correlation between two variables is 1.00, the standard error of estimate equals 0.
- Pearson r is the slope of the least squares regression line when the scores are plotted as z scores.
- When there are two predictor variables, R^{2} is the simple sum of r^{2} for the relationship of the first predictor variable and Y and r^{2} for the relationship of the second predictor variable and Y.
- For regression purposes, it is customary to assign Y to the predicted variable.
- For regression purposes, it is customary to assign Y to the variable we are predicting from.
18 For regression purposes, it is customary to assign X to the variable we are predicting from.
- In regression analysis we are only concerned with perfect as opposed to imperfect relationships.
- If we minimize S (Y–Y’)2, we will minimize the total error of prediction.
- The value aYis the X axis intercept for minimizing errors in Y.
- Generally, one can use the same regression equation for predicting Y given X as for X given Y.
- If the relationship between two variables is perfect the standard error of estimate equals 0.
- If the standard error of estimate for relationship 1 equals 5.26 and for relationship 2 it equals 8.01 then we can reasonably infer that relationship 2 is less perfect than relationship 1.
- It is impossible to have a negative value for the standard error of estimate.
- In general one is less confident in predictions of Y when the value of X used for the prediction is outside the range of the original data used to construct the regression line.
- If the regression line is parallel to the X axis then the slope of the regression line equals 0.
- The regression line will always go through the point .
- If X and Y are plotted as standard (z) scores, then r equals the slope of the resulting regression line.
- If sY= sX, then r = bY.
- Using a second predictor variable always increases the accuracy of prediction.
Short Answer
- Define Homoscedasticity.
- Define least-squares regression line.
- Define multiple coefficient of determination.
- Define multiple correlation.
- Define regression.
- Define regression constant.
- Define regression line.
- Define regression of Y on X.
- Define standard error of estimate.
- Why is it important to know the standard error of estimate for a set of paired scores?
- Why does the least squares regression line minimize S (Y – Y’)2, rather than S (Y – Y’)?
- The least squares regression line is the prediction line that results in the most direct “hits.” Is this true? Explain.
- In multiple regression, will use of a second predictor variable always increase the accuracy of prediction? Explain.
- If there is no relationship between the X and Y variables and we desire to predict Y given X using a least-squares criterion, it is best to predict for every Y score. Is this correct? If so, explain why. (Hint: one of the properties of the mean might be helpful here)
- A friend that thinks a lot about statistics asserts that, “the closer the points in the scatter plot are to the least-squares regression line, the higher the correlation.” Is your friend correct? Discuss.
- When doing regression, what is the convention used for assigning X and Y to the data?
- X represents aptitude test scores and Y represents grade point average in college. If the least-squares regression line for the relationship between these two variables is Y’ = .005X + 1.2, what GPA would you predict for people who scored each of the following scores on the aptitude test?
- 159
- 300
- 500
- 550
- Draw a graph of aptitude test score versus grade point average and construct the regression line for the line Y’ = .005X + 1.2.
- A professor wanted to predict final exam scores from midterm exam scores. He used data from several different professors teaching the same class. He obtained the following data:
What are the values for each of the following?
- S X.
- S X2.
- N.
- S Y.
- S Y2.
- (S X)2.
- (S Y)2.
- S XY.
- bY.
- aY.
- If the professor’s class score on the midterm were 77.4, what score would you predict the class would receive on the final exam?
- What is the value of ?
- A hospital administrator wanted to predict the number of patients her hospital would admit in 1990. The following data were obtained from past records:
Year | 1960 | 1965 | 1970 | 1975 | 1980 |
Number of Admissions | 812 | 983 | 1127 | 904 | 1768 |
- What would the best prediction be for the number of admissions expected in 1990?
- What serious caution should the administrator be aware of when making her prediction?
- A psychologist wanted to use a locus of control test to predict scores on a depression scale. The following data were summarized for the relationship between the locus of control and depression scale:
S X = 62, S X2 = 1022, S Y = 70, S Y2 = 1234, S XY = 1107, N = 4
- What is the value of bY?
- What is the value of aY?
- What would the psychologist predict for the score on the depression scale if a client scored an 18 on the locus of control scale?
- Consider the following set of data points:
X | 2 | 4 | 8 | 14 | 20 | 23 | 25 |
Y | 2 | 6 | 14 | 20 | 12 | 9 | 7 |
- Construct a scatter plot of the points.
- Is it appropriate to use a least-squares 1inear regression line to predict Y from X in this case? Why or why not?
- Consider the following set of points for variable X and variable Y:
X | 21 | 29 | 33 | 40 | 50 |
Y | 34 | 36 | 42 | 45 | 58 |
Assume the relationship is linear in answering the following questions.
- What is the value of the regression constant bX for predicting X given Y?
- What is the value of the regression constant aX for predicting X given Y?
- What value of X would you predict for a value of Y = 37?
- What value of X would you predict for a value of Y = 55?
- What is the linear regression equation for predicting Y given X for the following pairs of scores:
X | 9 | 15 | 25 | 27 | 42 | 50 | 30 |
Y | 14 | 11 | 5 | 5 | 0 | –8 | 1 |
- Given the following data, what are the values of bYand aY?
sX = 12 sY = 14 S X = 57 S Y = 83 N = 10 r = .843