Exploring & Describing Data

 Calendar | Course Managment  | Contact us   
   Main Concepts  | Lecture | Activity | Data Collection & Analysis | Teaching Tips  | Technology Tips  | Quiz  | Want to Know More? 
  • Introduction    • Distributions    • Graphical Techniques    • Numerical Summaries    • Lecture 1: Using Graphs to Analyze Data    • Transcript of Lecture 1


Marketers are often interested in knowing demographics about particular regions. One of the most important things you can know about an area, if you are intending to sell something to people who live there, is what sort of incomes people make.

Open the Fathom file called CA_includesBeverlyHills6200.ftm. This is an excerpt (of just 500 people) from data collected as part of the 2000 U.S. census on residents in Beverly Hills and surrounding areas. We're interested in knowing what sort of incomes people in this area make. Open the inspector and flip through some cases, paying attention to the income variable. What do you see?
One thing you'll notice that's so unsurprising that you might not have thought to mention it is that not everyone has the same income. This is a fundamental feature of variables in Statistics (and something that differentiates them from variables in Mathematics): variables vary. There's no single "right" value. So instead we have to find ways of summarizing this variability.
You'll also, no doubt, notice that there's a surprising number of 0's and you might think of many reasons for why this is so. Perhaps this people lied or refused to answer the census agents. Perhaps they are children. Perhaps they are illegal immigrants. Perhaps it has something to do with the way data were recorded. Some of these questions can be answered by examining other variables. But some questions can only be answered by reference to things outside the data set itself.
This is, in part, what we mean when we say that data are "numbers in context." The numbers you see were influenced by scientific, logistic, beauracratic, social factors. It is important for statisticians to understand this context as much as possible.

This context determines to a large extent how we'll go about summarizing the data. For example, if I'm selling cars, I might wonder what percent of the residents can afford a Porsche (or a Toyota). Or I might want to understand how likely it is that residents will have extra income to donate to my charity. I might want to compare incomes of different age groups. Or I might be interested in understanding if there are gender differences in income.

All of these questions are difficult to answer simply by flipping through the 500 cases in this variable set and getting a general impression. We need a better plan of attack. Your readings will suggest two very broad categories of summaries: graphical and numerical.
Each summary technique will emphasize certain aspects or characteristics of a variable, but will discard others.