Unit 2: Two-Variable Relationships Main Concepts | Demonstration | Teaching Tips | Data Analysis & Activity | Practice Questions | Connections | Fathom Tutorial | Milestone
 Teaching Tips • Regression is a complex topic that we cover fairly quickly. We'll cover more of this in the next unit. But now is probably a good time to give you some organizing principles. I like to teach this by presenting it the same way I would analyze data: Plot y versus x and give a verbal description in the context of the data. In particular we want to know if the relationship looks linear. If it looks linear, then compute the correlation as a means of quantifying the strength of this relationship. If it looks linear, compute the regression line as a means for making predictions about future observations or as a means of quantifying the rate of "change" between x and y or as a means of understanding how "typical" y-values differ for different values of x. Now look at residuals to see if the relationship really looks linear. You already did this with the scatterplot, but looking at residuals gives you a "sharper" picture. It's as if the residual plot has focused your vision a bit and might point out non-linearity that you couldn't see before. You also want to see if there are any influential outliers that might affect your interpretation of the slope. If you still think the data are linear, write an interpretation of the intercept (if applicable) and the slope, taking into account any influential observations. Assuming the data are linear, calculate r-squared. This functions as a bit of "currency" for consumers of your regression. If your r-squared is "good", they'll buy your explanation. If not, they might think twice. For example, I might have confirmed to everyone's satisfaction that there is a linear relationship between the amount of protein in a city's diet and the average price of steak in that city. Bit if my r-squared is very low, this means there's so much variability that my regression line might be of little practical use. • Interpretation of the linear model is key; Emphasize that the regression line is a line of averages and does not specify individual outcomes. • When it comes to interpretation, the slope often carries quite a bit of meaning and the intercept does not. It is not unusual for the intercept to be meaningless. Many variables, height for example, do not take on the value 0. • Interpreting the slope in context of the data collected, and without implying a causal connection between x and y, is hard. You want to be very careful not to use the algebraic interpretation of slope: as x goes up by 1 unit, y changes by b units. One reason for this caution is that quite often in a data set, we do not get to observe x changing for any given unit. So if y represents "sons' heights" and x represents "fathers' heights", then to interpret the slope to mean that as father's heights increases by 1 inch the average sons' height goes up by b inches is non-sensical, since fathers do not grow. • For most observational studies, (a good example is the father's and son's heights referred to in the last bullet), a good interpretation of the slope is to compare how y values differ for different x values. For example, if the slope between father and son heights were 0.5, we could correctly interpret this slope to mean "Sons whose fathers' heights differ by 1 inch, differ in average height by 0.5 inches." • The regression of y on x is not the same as that of x on y. However, the correlation between y and x is the same as between x and y. • There are two parts to the correlation coefficient and each provides different information:  the sign tells the direction of the trend and the quantitative value tells the strength. • When using the regression line for prediction, the R-square value is important.  For application of the model to real-world contexts, the interpretation of the slope is important as it describes the relationship between the variables. • In introductory courses, we plot residuals versus x, but in more advanced courses (and many software packages) residuals are plotted against the predicted values of the y variable. The two methods are equivalent when there is only one predictor variable. In advanced courses we do "multiple regression" in which there are several predictors, and in that context it makes more sense to plot the residuals against the predicted values. • Outliers might or might not be influential points, but when looking for influential points, they are a good place to start. Influential points are points that, if moved or removed, would result in a drastically different slope. The word “drastically” is not a technical term. See the Activity tab for some examples of this. • Correlation appears in several guises. First, it provides a quick, numerical summary of the "strength" of a linear relationship. In this context, correlation only makes sense if the relationship is indeed linear. Second, the slope of the regression line is proportional to the correlation coefficient: slope = r*(SD of y)/(SD of x) Third, the square of the correlation, called "R-squared", measures the "fit" of the regression line to the data. The standard phrase is: r-squared measures the percent of variation in y explained by the variable x. This sentence, while precise, has no meaning to most people and will need to be carefully explained. (See the demonstration.) However, it is easy to use r-squared: if it is low (near 0) then there is still a lot of unexplained variability in the data. If it's close to 1, then the regression line does a good job of fitting the data. I like to explain r-squared in the context of prediction: a high r-squared means that if you tell me the value of x, my prediction of y will be pretty close to what we actually observe. But if r-squared is low, my prediction might be pretty far off. • Some software packages allow you to force the intercept to be 0. This is almost always a bad idea even if you think the intercept must be 0. Student Misconceptions and Confusions • Students confuse the interpretation of the regression equation with that of the linear equation in algebra.  In algebra, the x-values can change.  In regression, the x-values are data values for the explanatory variable of interest and are fixed. • Students will want to describe the predicted value as if it is an exact, deterministic value, rather than the average it is. • Sometimes students will equate a steep slope with a high value of the correlation coefficient. This is an easy mistake to make, because the slope does depend directly on the correlation coefficient. However, the ratio of the standard deviations of y to x plays an equal role, and so one should not think "steep slope == high r". • High correlation does not mean that the linear model is good and low correlation doesn’t mean that the linear model is inappropriate. The correlation coefficient measures the data's proximity to a straight line, but it does not measure the appropriateness of the linear model. • As mentioned in the main concepts, correlation does not imply causation. Just because two variables are correlated (have an association), it does not mean that the independent (or explanatory) variable causes to dependent (or response) variable to behave the way it does. A controlled experiment is necessary to conclude causation. • Beware of extrapolation! Many physical phenomena are linear over a short range of values, but fail to be linear over a greater range. The moral of this story is that just because a trend appears to be linear for the data you observed does not mean it will be linear for data beyond this range. • You might notice some confusion over the use of the term "linear models" as you read Statistics texts. Statisticians refer to “linear models” as any model that is linear with respect to the parameters of the model. But most introductory texts use “linear model” to mean linear with respect to the data. This means that for a statistician, y = a + b*x + c*x2 is a linear model, but y = a exp(b*x) is not. If this paragraph confuses you, ignore it. Resources • Line of averages • Give your students practice reading and interpreting computer printouts from regressions. This has appeared on the AP exam. • Some experiments that made good linear relationships: http://courses.ncssm.edu/math/Stat_inst01/PDFS/helilinear.pdf • Recommend sections of The Statistical Sleuth for background reading.