Exploring & Describing Data

 Calendar | Course Managment  | Contact us   
   Main Concepts  | Lecture | Activity | Data Collection & Analysis | Teaching Tips  | Technology Tips  | Quiz  | Want to Know More? 
  • Introduction    • Distributions    • Graphical Techniques    • Numerical Summaries    • Lecture 1: Using Graphs to Analyze Data    • Transcript of Lecture 1

 Graphical Techniques

Graphical techniques all center around different ways of picturing/displaying the distribution. One popular technique, the boxplot, is usually not taught until after teaching some numerical techniques since it requires a little more background.
Graphical techniques include the dot plot, the histogram, ntigram, and the stem-and-leaf plot. There is nothing to prevent you from making up your own. Florence Nightengale, now remembered more for her nursing than her statistical savvy, invented one she called the coxcomb graph because it was as colorful as a rooster's tail.
The plots above all provide a way of picturing the frequencies of a frequency distribution. You might be tempted to get caught up in the details of how to craft these plots. But in usual practice, a computer or calculator will do this. Far more important is learning how to interpret these plots and learning when to use them. (Of course, some practice in making these plots "by hand" might assist with learning to interpret them.)

By far the most common technique is the histogram. Making a histogram requires dividing up your data into "bins" and then counting the number of observed values that fall within each bin. For example, with our income variable we might define one bin to be 0$ to 2000$. We could then count how many observations fell in that range.
There is no reason that the bins need all be the same width (although they almost always are, particularly if computer drawn). And it's necessary to determine what to do with observations that fall on the borders. (If we have an income of 1000$, do we put it in the 0-1000 bin or in the next bin to the right?)
There are two features of histograms to pay attention to. One is much more important than the other:
1) the y-axis. The height of the bar drawn can represent either counts (how many observations), frequency or "relative frequency" as it is sometimes called (what percent or fraction of the total number of observations fell into this bin) or density. Density means the frequency of observations per x-unit. Think of density in terms of population: the density of a neighborhood is the number of people per square foot, for example. In a histogram, the density is the fraction of total observations per what-ever x-unit you are using.
Fathom makes it easy to switch units in the histogram. You'll notice the shape of the histogram doesn't change, but the labels do. The density units here (which aren't displayed on the graph) are "percent per dollar."
Density histograms have a very important practical use: the area of the bars represents the percent of observations falling within that bin. So in my histogram of the income variable:
the height of the first bar is about 0.000025 density units
the width is 2000 dollars (the bin extends from 0 to 2000).
the area of this bar is 0.000025*2000 = .05
and so 5% of the people in the same have incomes between 0 (inclusive) and 2000 dollars.
2) the bin-width. You can change the shape of the histogram, sometimes quite drastically, by changing the width of the bins. If the bins are wide, you lose detail, and if they are too narrow you get too much detail. There's no rule that says what the width should be. Most computer programs make a smart guess. But you should always alter the widths a bit to get a sense of the range of possible shapes.
What do look for in a graphical summary of a distribution
Shape: does it have one hump or two, or more? Is it symmetric? Skewed? Are there gaps? Single isolated values?
Center: about where is the 'center' of the distribution. (More about how to define center later)
Spread: how spread-out are the values? This is closely related to shape. Are the values clustered around a central value, or are there several "bunches"?