Unit 14: Regression Revisited

Home | Contact us   
  Main Concepts | Demonstration | Teaching Tips | Data Analysis & Activity | Practice Questions | Connections | Fathom Tutorial | Milestone 
 

   

Data Analysis & Activity

Activity 1

This activity will (hopefully) reinforce the idea of slope and intercept as statistics, each varying from sample to sample. We will use Beth Chance's regression sampling applet:

http://statweb.calpoly.edu/chance/applets/regcoeff/regcoeff.html

The applet allows you to select the population slope and intercept, which in turn determine the population regression line (in yellow). You can also choose a mean and standard deviation for the x-values, and finally the population standard deviation for the responses about the regression line. For now, let's all be consistent:

• Set the population slope to 1.5 and the population intercept to 2. Keep all other values the same.

• Click the Set Population button to create your new population of data. (Note:If you would like to see more of the graph, you can change the window frame using the gray boxes along the 4 sides of the graph.)

At the bottom of the page, you should see the equation y = 1.50x+2. That is the population equation.

We will now sample from the population displayed on the graph (the blue dots).

• Hit the Draw Samples button once. The applet randomly selects a sample of points (in red; n = 80 is the default). Then the applet calculates the least squares regression line for those n points and graphs that line in red. The equation of the line appears at the bottom of the page. Is it exactly the same as our population equation? Do the graphs line up exactly? Why not?

• Hit the Draw Samples button a few more times, just to see how the samples --and, hence, the resulting least squares regression lines-- differ from sample to sample. This is an illustration of sampling variability.

• Change the "num samples" from 1 to 100, then click Draw Samples. The applet will superimpose all 100 sample least squares lines onto the graph (the "wave" of red) and launch a window with dot plots of the sampling distributions of the slope and intercept. Focus on the slopes: what do you see? Is the center of the dot plot reasonably close to 1.5? Do you notice a shape forming?

• Before you close the dot plot window, note the standard deviation for the slopes somewhere.
Now let's see how varying the other "parameters" of the applet changes things.

• Hit the Reset button.

• Change the value of sigma from .45 (the default) to 2.45, and click Set Population. What do you notice happened to the population graph? Remember, sigma is the standard deviation of the y-values about the regression line.

• Once again, take 100 samples of size n = 80 and look at the sampling distribution of the slopes. What happened to their spread? (That is, did the standard deviation of the slopes increase or decrease, compared to the value you noted earlier?) Is this what you would expect to happen for a larger value of sigma?

• Change sigma back to .45 (the default) and click Set Population. For our next illustration, change the sample size from 80 to 20. Again, take 100 samples and look at the resulting slopes. What happened to the spread of the slopes this time? Is this what you would expect to happen for a smaller sample size?

• Finally, change the sample size back to 80 (the default). How do the x-values play a role? To find out, change "x std" (the standard deviation of the x-values) from 1.84 to 4.84. With sample size back at 80, take 100 samples again. What happened to the spread of the slopes? Does this result surprise you?


Assuming the simulations went according to plan, we should have found three patterns among the variety of lines provided by the variety of random samples:

1) The larger the standard deviation of the responses about the line, the more widely-varying our estimates of the slope will be.

2) The variability of the sampling distribution of the slopes is larger for smaller sample sizes.

3) Slopes across different samples are less variable when the x-values are more variable.


Activity 2

This analysis project condenses an in-class activity created by Mary Mortlock and Matt Carlton. For the full version of the activity (and other similar activities), check out http://statweb.calpoly.edu/carltonm/food/index.html.

Recently, Mary had each of her students measure his/her hand span and then try to grab as many Tootsie Pops as possible from a large bowl. The goal: predict the number of Tootsie Pops a person can grab, based upon his/her hand span. You can use the data from Mary's class: http://schematyc.stat.ucla.edu/unit_14/tootsiepops.txt (a tab-delimited text file). Or, collect your own data! Please discuss your findings on the discussion board (if your instructor is using a discussion board, of course.)


Part I: Descriptive Statistics (review of previous material)


(a) We want to use hand span to predict the number of Tootsie Pops a person can pick up. Which is the explanatory variable, and which is the response variable?
(b) Create a scatter plot of the data. Describe all the features you see.
(c) Compute and interpret r and r2 for this data.
(d) Compute the least squares regression line for this data. What are the meanings of the intercept and the slope in this context? Do they make sense?
(e) Make a residual plot. Identify and discuss any outliers or influential points.
(f) Predict the number of Tootsie Pops picked up by someone with a hand span of 22 cm and someone with a hand span of 27 cm. Which prediction do you feel is more reliable, and why?


Part II: Statistical Inference


We now want to determine whether there exists a true linear relationship between hand span and the number of Tootsie Pops a person can pick up.
(g) What is the relevant parameter?
(h) State the appropriate null and alternative hypotheses.
(i) What conditions must be satisfied to validly conduct this hypothesis test? (Be aware that you can't check all of the assumptions using what you've learned so far.)
(j) Look at the residual plot again. Does this plot indicate one or more of our conditions is satisfied or violated?
(k) Create a normal probability plot of the residuals. Does this plot indicate one of our conditions is satisfied or violated?
(l) Test your hypotheses at the 5% significance level. Be sure you include your test statistic (with d.f.) and thep-value of your test.
(m) What is your conclusion? Be specific and be sure your conclusion is in context of the problem.