• Exploratory Data Analysis
• Comparing Groups
- As mentioned in the previous units, confidence intervals
hypothesis tests help students use data to just answer that simple
investigative question that they may have developed about a particular
population. This should reference back to the first unit of the course.
- In order to evaluate whether assumptions hold, need to
look at shape, center, spread of sample distribution.
- Comparing groups has been a theme of the course since the
first week. The methods presented in this chapter will finally help
students to more formally answer the questions they may have asked in
Unit 1: “Is the observed difference between two groups due to chance?”
or “Is the observed difference between groups large enough to be real?”
- We can now make inference about the difference between
two populations using two samples from populations. Just as the
previous units discussed, it is important to reiterate that we just
need one sample from each population to make inference.
- Again, two-sample intervals and tests still rely on
sampling distributions and the Central Limit Theorem (if assumptions
are met), so it is important to show students how these concepts all
- If the Central Limit Theorem applies, students will need
to recall normal models and z-scores. Remember: their samples may not
be normal, but the sampling distribution of the differences between
their sample means or proportions is normal under Central Limit
- It is very important to draw students back to informal
inference done earlier using simulations; show them the same example
using both methods so they can see that both achieve same goal; one is
just a shortcut/approximation that might be faster to do.
- If students can understand the intuition behind
simulation-based inference, then formal hypothesis testing is just a
shortcut we can use when assumptions are met that does the same thing
with less work.
- P-values are just how likely we are to get the observed
statistic if we assume a certain model/hypothesis is true; this is the
case whether we use the normal model (and look at a shaded region
beyond the observed value) or use a model built from many simulations
(and count the observations beyond the observed value).
- In this course, we assume a sampling distribution follows
a model with some hypothesized mean. When dealing with two samples, we
use the sampling distribution of the difference between sample
statistics from two populations and the hypothesized mean is typically
- But no matter what, we assume some “chance” model that is
plausible for our estimate.
- It is plausible that our estimate comes from this model,
but how plausible given the mean of that model? If it’s not very
plausible, then we should look for a new model.
- The p-value is just a conditional probability; given the
null hypothesis, what is the probability of getting the observed
statistic (or something more extreme)?