Unit 2: Two-Variable Relationships
|Main Concepts | Demonstration | Teaching Tips | Data Analysis & Activity | Practice Questions | Connections | Fathom Tutorial | Milestone|
Teaching Tips• Regression is a complex topic that we cover fairly quickly. We'll cover more of this in the next unit. But now is probably a good time to give you some organizing principles. I like to teach this by presenting it the same way I would analyze data:
• When it comes to interpretation, the slope often carries quite a bit of meaning and the intercept does not. It is not unusual for the intercept to be meaningless. Many variables, height for example, do not take on the value 0.
• Interpreting the slope in context of the data collected, and without implying a causal connection between x and y, is hard. You want to be very careful not to use the algebraic interpretation of slope: as x goes up by 1 unit, y changes by b units. One reason for this caution is that quite often in a data set, we do not get to observe x changing for any given unit. So if y represents "sons' heights" and x represents "fathers' heights", then to interpret the slope to mean that as father's heights increases by 1 inch the average sons' height goes up by b inches is non-sensical, since fathers do not grow.
• For most observational studies, (a good example is the father's and son's heights referred to in the last bullet), a good interpretation of the slope is to compare how y values differ for different x values. For example, if the slope between father and son heights were 0.5, we could correctly interpret this slope to mean "Sons whose fathers' heights differ by 1 inch, differ in average height by 0.5 inches."
• The regression of y on x is not the same as that of x on y. However, the correlation between y and x is the same as between x and y.
• There are two parts to the correlation coefficient and each provides different information: the sign tells the direction of the trend and the quantitative value tells the strength.
• When using the regression line for prediction, the R-square value is important. For application of the model to real-world contexts, the interpretation of the slope is important as it describes the relationship between the variables.
• In introductory courses, we plot residuals versus x, but in more advanced courses (and many software packages) residuals are plotted against the predicted values of the y variable. The two methods are equivalent when there is only one predictor variable. In advanced courses we do "multiple regression" in which there are several predictors, and in that context it makes more sense to plot the residuals against the predicted values.
• Outliers might or might not be influential points, but when looking for influential points, they are a good place to start. Influential points are points that, if moved or removed, would result in a drastically different slope. The word “drastically” is not a technical term. See the Activity tab for some examples of this.
• Correlation appears in several guises.
Student Misconceptions and Confusions• Students confuse the interpretation of the regression equation with that of the linear equation in algebra. In algebra, the x-values can change. In regression, the x-values are data values for the explanatory variable of interest and are fixed.
• Students will want to describe the predicted value as if it is an exact, deterministic value, rather than the average it is.
• Sometimes students will equate a steep slope with a high value of the correlation coefficient. This is an easy mistake to make, because the slope does depend directly on the correlation coefficient. However, the ratio of the standard deviations of y to x plays an equal role, and so one should not think "steep slope == high r".
• High correlation does not mean that the linear model is good and low correlation doesn’t mean that the linear model is inappropriate. The correlation coefficient measures the data's proximity to a straight line, but it does not measure the appropriateness of the linear model.
• As mentioned in the main concepts, correlation does not imply causation. Just because two variables are correlated (have an association), it does not mean that the independent (or explanatory) variable causes to dependent (or response) variable to behave the way it does. A controlled experiment is necessary to conclude causation.
• Beware of extrapolation! Many physical phenomena are linear over a short range of values, but fail to be linear over a greater range. The moral of this story is that just because a trend appears to be linear for the data you observed does not mean it will be linear for data beyond this range.
• You might notice some confusion over the use of the term "linear models" as you read Statistics texts. Statisticians refer to “linear models” as any model that is linear with respect to the parameters of the model. But most introductory texts use “linear model” to mean linear with respect to the data. This means that for a statistician, y = a + b*x + c*x2 is a linear model, but y = a exp(b*x) is not. If this paragraph confuses you, ignore it.
Resources• Line of averages
• Give your students practice reading and interpreting computer printouts from regressions. This has appeared on the AP exam.
• Some experiments that made good linear relationships: http://courses.ncssm.edu/math/Stat_inst01/PDFS/helilinear.pdf
• Recommend sections of The Statistical Sleuth for background reading.