Unit 12: Comparing Two Populations
|Main Concepts | Demonstration | Teaching Tips | Data Analysis & Activity | Practice Questions | Connections | Fathom Tutorial | Milestone|
Data Analysis & Activity
Students have a hard time understanding whether or not to treat a dataset as "paired." In this activity, we're going to see why "pairing" is important, and why it's wrong to treat paired data as two independent samples.
1) Download the acid rain data set and load it into Fathom.
2) This data set includes pH measurements from rain samples at
32 locations in a particular county. A substance at room temperature is
The first variable, "lastyear", consists of measurements taken from water samples from 32 locations last year. County officials are concerned that the rainwater is becoming more acidic, which would indicate a problem with pollution. The second variable, "thisyear", consists of measurements taken from the same sites as in "lastyear". Our goal is to determine whether the mean pH level has fallen since last year.
As a first step, we'll consider the variables separately (unpaired). Make an appropriate graphical summary. (a) Describe the graph. (b) What would you conclude?
3) Find a 95% confidence interval for the difference in the mean pH level for the current year and for last year. What do you conclude about the difference of the means?
4) Perform a hypothesis test with a 5% significance level to see whether the mean pH level is lower this year.
Now we'll "pair" the data. The reason for pairing is that the measures in each year come from the same site. So, for example, the first value in "lastyear" is taken from the same site as the first value for "thisyear". And for this reason, we have good reason to suspect that the assumption that the samples are independent has been violated.
5) Create a new variable called "diff". In Fathom, click twice on the collection to open the inspector. (If the inspector is already open, skip this step.) On the inspector, click on the "New Attribute" field and name the new attribute "diff". Click twice on the "formula" field that corresponds to this new attribute. Type "thisyear-lastyear" ; the new variable will now have the value of this year's value minus last year's.
6) Make an appropriate graphical summary of the difference between the two years. Describe the graph and state your (preliminary) conclusion.
7) Find a 95% confidence interval for the mean of the differences. What do you conclude?
8) Perform a hypothesis test with a 5% significance level on the difference of the means to see if the pH level is lower this year.
Why did we get different conclusions?
9) To see this, make a scatterplot of "thisyear" against "lastyear". What's the correlation?
10) Remember that the width of a confidence interval depends on the standard error of the estimator; the width of a confidence interval is 2*K*SE. Let Xbar represent the average of this year, and Ybar represent the average of last year. Then Var(Xbar-Ybar) = sigma2x/n + sigma2y/n - 2*rho*sigmax*sigmay, where rho is the correlation between X and Y. (To refresh your memory, review the data collection section of Unit 7.)
For this dataset the correlation between X and Y is about 0.9. Now the standard error is the square-root of the quantity above. You can see that if we were to ignore the fact that X and Y are correlated -- which is equivalent to setting rho=0 -- we would get a wider confidence interval. So when X and Y are positively correlated, if you ignore the pairing you get a confidence interval that's too big and might miss an interesting difference in means. On the other hand, if X and Y are negatively correlated, you'll get a confidence interval that's too small, and might mistakenly think there's a difference when there's really not! This illustrates what can happen when you do an "unpaired" test with paired data.
If Xbar represents the average of thisyear, and Ybar the average of lastyear, then the standard error is SD(Xbar-Ybar). Calculate this assuming (a) Xbar and Ybar are independent and (b) assuming the true correlation is the same as the sample correlation that you calculated in (9).
This data set comes from an article in the Journal of Statistics Education ( "Datasets and Stories" article "Data Management, Exploratory Data Analysis, and Regression Analysis with 1969-2000 Major League Baseball Attendance" in the Journal of Statistics Education (Cochran 2002, www.amstat.org/publications/jse/v10n2/datasets.cochran.html).)
However, you should access the data here: http://schematyc.stat.ucla.edu/unit_12/mlbdata.txt
This file has several potential response variables. Your goal is to see how the American and National leagues compare on these variables. Some questions to consider:
a) What "research" question does a particular hypothesis test help answer?
b) Are the assumptions supported for a particular hypothesis test? Is a hypothesis test even needed? Does comparing means make sense?
c) Note that we have ALL data from 1969 to 2000. Is a statistical test meaningful in this context? To which population are we making inferences? What must we assume to make this test meaningful?