Unit 6: Probability Essentials Main Concepts | Demonstration | Teaching Tips | Data Analysis & Activity | Practice Questions | Connections | Fathom Tutorial | Milestone

Data Analysis & Activity

Activity 1

In this activity, we'll explore conditional probability through data and look at multiple ways of displaying categorical information. Our situation: the search for terrorists in the United States.

We will start with the hypothetical data suggested in a recent article (pdf) by John Allen Paulos. (Note: you should read the article after this Activity.)

• Suppose that 1,000 of the 300,000,000 residents of the United States are involved in terrorist activity.

• Suppose further that the government's terror detection system is "99% accurate" in two senses: 99% of all terrorists are flagged as terrorists, and 99% of all non-terrorists are correctly identified as non-terrorists.

Our goal is to answer the following question: If our terror detection system flags someone as a terrorist, what is the probability he really is a terrorist?

Let's first try to display the data we have in a table. We might begin this way:

 Yes No Total Terrorist? Flagged? Total 300,000

Will this table format work? No: in this table, we cannot cross-classify individuals (e.g., where do we fill in the number of flagged terrorists?). In the language of probability, we must have mutually exclusive rows in our table. The events "Terrorist" and "Flagged" are not mutually exclusive. Students often make this mistake in building a contingency table, especially if the two category labels are similar, like "has a Visa card" and "has a MasterCard."

Instead, we have to format the table as below:

 Terrorist Non-terrorist Total Flagged Not Flagged Total

Use the hypothetical numbers from above to fill in all 9 cells of the table. I'll get you started: the total number of terrorists is 1,000, and the grand total number of people is 300,000,000. You can find two of the interior cells using the 99% accuracy rates, then the remaining cells are deduced by simple subtraction. Think you have the answer? Compare your numbers to the table below:

 Terrorist Non-terrorist Total Flagged 990 2,999,990 3,000,980 Not-flagged 10 296,999,010 29,6999,020 Total 1,000 299,999,000 300,000,000

We can now answer lots of questions about this hypothetical terror detection system. For example, the probability someone is really a terrorist, given that he was flagged, is just 990/3,000,980 = 0.033%. Scary, eh?

It might be clear to some of you why this percent is so low, but we'll explore this a little later in the Activity.

To comfort you, let's compute the probability someone is not a terrorist, given he is not flagged: 296,999,010/296,999,020 = 99.9999966%. Whew!

How do these numbers relate to the standard conditional probability formula? Let T and F stand for the events "Terrorist" and "Flagged" respectively. We want P(T|F), which supposedly equals P(T and F)/P(F). We still need the table to find either of these quantities.

P(T and F) is the probability someone is a terrorist and is flagged, equaling 990/300,000,000 according to the table. Likewise, P(F) is the probability someone is flagged, which the table shows to be 3,000,980/300,000,000. Take the ratio, and the denominators of 300,000,000 cancel.

A contingency table is the ideal way to display the relationship between two categorical variables, especially when we have whole numbers (rather than just percents). It takes some time to assemble the entire table, but it allows students to easily find conditional probabilities once the table is complete.

When we only have percentages, and some of those percentages are conditional probabilities, a tree diagram is ideal. In our terrorism problem, let's say we didn't know the population size, but only that P(T) = t; i.e., 100t% of all U.S. residents are involved in terrorist activity. Since that information is not conditional, Terrorist versus Non-terrorist (T and Tc) can form the first branches of our tree. You should draw the diagram yourself; one branch has probability t, so what's the other branch's probability?

From each of these primary branches, draw two secondary branches: Flagged and Not Flagged (F and Fc). Persist with "99% accuracy" for now: P(F|T) = 99%, and which other conditional probability equals 99%? What conditional probabilities are on the remaining two secondary branches?

With your tree diagram complete, you can again find P(T|F). Use the formula P(T|F) = P(T and F)/P(F) as your inspiration, and follow the branches.

Here's what you should get: the numerator has only one term, since only one "node" corresponds to T and F; its probability is t * 0.99. The denominator has two terms, since there are two "nodes" corresponding to event F; their collective probability is t * 0.99 + (1-t) * 0.01.

And so, P(T|F) = 0.99t/[0.99t + 0.01(1-t)]. If you plug in t = 1000/300,000,000, you'll get our answer from earlier.

Two pedagogical notes, before this Activity really gets interesting. First, students master tree diagrams faster than you'd expect, especially if you lead them through the fundamentals a few times (unconditional probabilities on the first branches, multiply probabilities along paths, etc.). Second, you might recognize our fractional answer above as a form of Bayes' Formula, and you're right! Bayes' Formula is not on the AP syllabus, but your students should still be able to answer conditional probability problems like this terrorism question. (They're expected to use a tree diagram or construct a contingency table.)

Digging deeper
Why is P(T|F) so low? Or, equivalently, why is the "false positive" rate – which equals 1-P(T|F) in our scenario – so high?

We found that P(T|F) = 0.99t/[0.99t + 0.01(1-t)], where t equals the proportion of all U.S. residents involved in terrorist activity. Make a graph of this probability as a function of t (what is the domain of t?). What do you observe?

With the correct graph, you'll see that the "true positive" rate is low when t is extremely low, but that rate improves dramatically even for modest values of t. Using the Trace tool on your calculator (or basic algebra), find the lowest value of t for which the "true positive" rate is at least 90%. (You should get t = 8.33%.)

What's going on mathematically? If almost nobody is a terrorist, then 99% of all terrorists is a tiny number relative to 1% of everyone else. So, the pool of "flagged" individuals consists almost entirely of innocent people who were flagged by mistake. In real life, the solution is obvious: anti-terrorist agencies use a set of criteria to narrow down the field of "likely" terrorists.

The same issue arises in medical testing, where you might have heard the term "false positive" before. If every American were screened for AIDS or some other rare disease, most positive tests would be false, and panic would ensue. Instead, the government does not mandate so-called "mass screening"; only those in high-risk groups are encouraged to get tested. Furthermore, those who receive a positive test result are encouraged to get re-tested.

Let's explore one last parameter: the "accuracy rate" of our detection system. For simplicity, we'll stick with the same parameter for both terrorists and non-terrorists: P(F|T) = P(Fc|Tc) = p. Then the conditional probability of a correct flag equals P(T|F) = p*t/[p*t + (1-p)*(1-t)].

So we can graph this function; let's set t = 10% to begin. Graph 0.10p/[0.10p + 0.90(1-p)], and you shouldn't be surprised at the result: our ability to flag the terrorists increases monotonically with p. How does the graph change if t = 0.01%? Plot both graphs on the same axes. Notice that when t is really small, we need p to be extremely large to have even a moderate "true positive" rate. We saw this with the original (hypothetical) data: even with 99% "accuracy," the proportion of terrorists among all flagged individuals was just 0.033%.

Finally, if your calculator or computer can graph in 3D, explore what happens when P(F|T) and P(Fc|Tc) are not the same. Even if you can't make the graphs, at least work out how to adjust the formula. If P(F|T) = p1 and P(Fc|Tc) = p2, what is P(T|F)?

Activity 2

1. Suppose that the UCLA Health Center administrators took a random sample of 635 UCLA students and asked them about various health and lifestyle habits. One question that they asked was, “Within the past month, did you drink alcohol often (more than 3 times a week), sometimes (1 or 2 a week), or never?” The results are in the following two-way table.

 Drink often Drink sometimes Never drink Total Live off campus 50 100 47 197 Live on campus 111 221 106 438 Total 161 321 153 635

Are students who live off campus more likely to drink than those who live on campus? in other words, is drinking independent of housing choices? Support your answer with the data from the table. Note, to answer this, we're assuming that the proportions you calculate based on the table are very close to the proportions in the population. As it turns out, this is a good assumption for these data.

2. A middle school teacher was curious about the role of parent involvement (attending parent/teacher conferences) and student achievement in Pre-Algebra. She collected data in the following table.

 Active parent participation Some parent participation No parent participation Total Above grade level 8 3 1 12 At grade level 13 12 3 28 Below grade level 2 12 8 22 Total 23 27 12 62

Are students whose parents actively participate in school conferences more successful in Pre-Algebra than those whose parents do not take an active role? Again, to answer, we must assume that the proportions represented in this table are similar to those in the population. Another way around this is to assume that we will be randomly selecting from the parents and students represented in this table, and understand that the questions we ask are with respect to just these children.

3. Suppose that Proposition ZZ asks voters for additional funding to update school libraries to be high speed internet connected. A sample of 96 registered voters was asked their party affiliation and whether or not they support Proposition ZZ. Results are displayed in the following table. Do we have evidence that party affiliation and position on Proposition ZZ are independent?

 Democrat Republican Other Total Favor Prop ZZ 24 18 30 72 Oppose Prop ZZ 8 6 10 24 Total 32 24 40 96