Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web site, be sure you can define, know when to use, calculate (with Spss), and interpret the following:
Indicators of Central Tendency A.
II. Indicators of Dispersion A.
III. Graphic Presentation and Summarization A.
Sort raw data
Reduce raw data to categories
Cumulative frequencies & percentiles
IV. Exploratory Data Analysis A.
Box and whisker plot
Stem and leaf display
Page 1 of 12
Displaying the Shape of the Distribution Goal: Determine how closely does the shape of the distribution approximates a Gaussian distribution. “Parametic” statistical tests — the kind we will study next —assume the data do indeed approximate a Gaussian distribution.
V. Indicators of a Gaussian distribution A.
Mean = Median = Mode
1 b 1 = --- • Σ • n
x i – x 3 ----------- s
measures the asymmetry of the distribution. A value of
zero indicates no skewness is present. The larger the value the more skewed the distribution. Negative skew indicates the tail of the distribution is to the left, with most of the scores clustering at the higher end of the scale. Positive skew indicates the scores cluster at the low end of the scale and the tail extends to the right. C.
Kurtosis: 1. 2. 3.
1 b 2 = --- • Σ • n
x i – x 4 ----------- s –3
indicates the flatness of the distribution.
Mesokutric: = 3 Platykurtic: < 3 Leptokurtic > 3
Graphs 1. Ogive 2. Normal Probability Plots
Statistical Tests 1. Chi Square
VI. Resistant indicators A. Central Tendency In certain data sets some observed values lie far way from the clump of the data values. These “outliers” or “extreme” scores, may be due to measurement errors, data recording errors, or may represent valid data points. Extreme scores influence unduly the mean and standard deviation. Suppose for example, that the mean annual salary in this class is $59,000. Now, imagine that for some reason Bill Gates decides to join our class. When we include his, say, $10,000,000 annual salary, we are now all millionaires, for the class mean is now $x,xxx,xxx. The mean is no longer descriptive of the average, for the mean is not resistant to extreme scores. Hence, use the median instead. The median is not influenced
Page 2 of 12
by the exact value of the largest score (or value) and thus is a more resistant measure of central tendency. B.
Dispersion. The range, clearly, is not resistant to the influence of extreme scores. Because each value in a distribution is included in the calculation of the variance and standard deviation, neither is resistant to extreme values. The interquartile range, because it is based on percentiles, is resistant to extreme scores. The lower quartile is the value such that 25 percent of all values fall below that value. The upper quartile is the value at which 25 percent of all values fall above it. The interquartile range is the difference between the upper and lower quartiles. In a large sample that approximates the Gaussian distribution, the interquartile range tends to be 1.34 times the sample standard deviation.
Shape of the distribution Resistent indicators of skewness and kurtosis also exist, such as the Yule-Kendall skewness statistic defined as:
x 0.25 – ( 2x 0.5 + x 0.75 ) ϒYK = -------------------------------------------------x0.75 – x 0.25
Other resistant indicators exist based on all the quantities such as L-moments but these are not included in an introductory discussion.
Page 3 of 12
Calculation of Mean and Standard Deviation Sample of 10 Scores from P102 Exam
Person A B C D E F G H I J
67 95 98 92 99 96 94 90 95 75
sum = mean =
skewness = kurtosis =
-23.1 4.9 7.9 1.9 8.9 5.9 3.9 -0.1 4.9 -15.1 sum -> variance -> standard deviation ->
Note that the mean is the arithmetic average
Σx i -------- . n
(x-M)2 533.61 24.01 62.41 3.61 79.21 34.81 15.21 0.01 24.01 228.01 1,005 100.49 166.17
The column labelled (x-M) shows the
amount by which each score deviates from the mean. This column will always sum to zero. The column labelled (x-M)2 is also known as the “sum of the squared deviations about the mean,” or just as “sum of squares.” The variance is the average of the sum of squares 2
Σ(x – M ) ------------------------- . n–1
and the standard deviation is the square root of the variance
Σ(x – M ) ------------------------n–1
To illustrate the impact of an extreme score, the instructor realizes that for student A, the score of 67 was mistakenly entered. In actuality, student A earned a score ot 57. Note the changes in the descriptive statistics when this single change is made.
Page 4 of 12
Effect of an Extreme Score Sample of 10 Scores from P102 Exam
A B C D E F G H I J
57 95 98 92 99 96 94 90 95 75
sum = mean =
skewness = kurtosis =
(x-M) -32.1 5.9 8.9 2.9 9.9 6.9 4.9 0.9 5.9 -14.1
sum -> variance -> standard deviation ->
(x-M) 1030.41 34.81 79.21 8.41 98.01 47.61 24.01 0.81 34.81 198.81 1,557 155.69 312.71
Note the changes in the descriptive statistics presented below. The mean changes slightly (about one percent), as you would expect due to an extreme score, but the median remains unchanged. This illustrates the meaning of “resistant” indicator. The standard deviation shows a 24 percent increase, skewness and kurtosis also show large changes, suggesting the shape of the distribution departs even further from the Gaussian. Original Data Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range
90.10 3.34 94.50 95.00 10.57 111.66 1.78 -1.66 32.00
One Extreme Score 89.10 4.16 94.50 95.00 13.15 172.99 3.85 -2.03 42.00
Page 5 of 12
Here is how skewness is calculated by hand for a different set of data:
1. List Raw Scores in a column
sum = y = mean = ( y)/n = M st dev = ¥var
y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 82.51 7.50 2.03
2. Subtract Mean from each Raw Score. Aka, Deviations from the mean
(y - M) 0.54 -0.55 0.08 1.31 0.83 2.46 -0.26 -3.24 3.34 -2.68 -1.82 0.00
4. Calculate skewness, which is the 3. Raise each of these sum of the deviations from the mean, deviations from the raise to the third power, divided by mean to the third power number of cases minus 1, times the and sum. Aka: Sum of standard deviation raised to the third third moment deviations power.
(y - M)3 0.16 -0.17 0.00 2.24 0.57 14.87 -0.02 -34.04 37.23 -19.27 -6.04 3 -4.46 sum = deviations = (n-1) stdev3 83.65 -0.0533 = skewness
Calculating Skewness: 1. First, calculate the mean and standard deviation 2. Subtract the mean from each raw score and cube (i.e., raise to the third power) 3. Sum the cubed deviations. 4. Multiply the number of scores minus 1 times the cubed standard deviation (i.e., raised to the third power). 5. Skewness = step 3 divided by step 4
Page 6 of 12
Keep in mind that if a distribution is positively skewed, the bulk of the values clump around the lower end of the scale with a few trialing off at the high end. Conversely, in a negatively skewed distribution, the bulk of the values clump around or near the high end of the scale with a few values trailing off at the low end. The following table summarizes the descriptive statistics for the P102 sample.
Table 1: Summary Statistics for P102 Exam Data Statistic
x max – x m in
non-resistant measure of skewness
non-resistant measure of kurtosis
resistant measure of location
x 0.75 – x 0.25
resistant measure of dispersion
resistant measure of skewness
median interquartile range Yule-Kendall
Comment number of cases/individuals non-resistant measure of location non-resistant measure of dispersion non-resistant measure of scale
Page 7 of 12
The equation for the Gaussian curve is
1 y = ------------- e σ 2π
–( x – µ ) ---------------------2 2σ
y = The height of the curve at a given value of x σ
= The standard deviation of the distribution.
= A constant (pi) of approximately 3.1416
x = A specific score within the distribution. e = The base of the Napierian logarithms, approximately 2.71828 µ
= The mean of the distribution.
= The variance of the distribution.
Page 8 of 12
Box Plots Box plots are useful in visualizing distributions. Consider the following scattergram of per capita income for each of the 50 states (y axis) with charitable deductions (x axis) listed on 1998 itemized tax returns.
Per Capita Income
An explanation of the box plot appears on the following page. The line or asterisk within the box is the median of the distribution. Fifty percent of the cases fall with the upper and lower hinges (the box boundaries). The upper hinge occurs at the 75th percentile, which is the third quartile, which corresponds to a z-score of .68. As discussed earlier, the median occurs at the 50th percentile, which is the second quartile and corresponds to a z-score of zero. The lower hinge occurs at the 25th percentile, which is the first quartile and corresponds to a z-score of — .68. The “whiskers” terminate at the largest and smallest values that are not considered to be outliers. The definitions for “outlier” and “extreme” scores may vary depending on the software program. A common definition for outlier is any value 1.5 box-lengths above or below the upper and lower hinges, and for extreme scores, any value more than 3 box-lengths above or below the upper or lower hinges respectively.
Page 9 of 12
In the charatible giving example one of the states (that shall remain nameless) has a high per capita income (around $27,000) but gives only about $1,000 to charity. Notice that the circle for this pair of data points lies beyond the whisker of the “charatible giving” box.
Page 10 of 12
Stem and Leaf Another useful data display is know as the stem and leaf. This is a simple way of displaying the distribution of data without having to use computer graphics. The characteristic that makes the stem and left unique is that very value in the data set is displayed. The stem and leaf “plot” groups the values in a data set according to their all but least significant digits. These are written in ascending or descending order to the left side of a vertical bar and are know as the “stem.” The “leaves” are formed by writing the least significant digit to the right of the vertical bar, on the same line as the more significant digits with which it belongs. The stem and leaf plot below shows the charitable giving for 100 individuals. We can see that least amout one person gave was $1,082 while the most one person gave was $5,779. Further, we can see that in the $4,000 range, the following exact values were given: $4,018, $4,057, $4,073, $4,095 . . . $4,814. The stem and leaf with vary slightly in appearance depending on the specific software used. Some programs enable you to examine the leaves in detail, by reporting the number of cases, the spread, the value of the lower and upper hinges, etc. 1*** | 082 1*** | 303 1*** | 1*** | 785 1*** | 870,976,985 2*** | 012,040,116 2*** | 212,242,256,296,308 2*** | 448,482,511,511,530,560 2*** | 609,632,686,718,740,785 2*** | 806,829,833,871,885,899,951,963 3*** | 001,010,015,028,030,088,164,170,171,178 3*** | 225,229,237,277,310,358,385,392 3*** | 413,414,439,450,450,502,519,594 3*** | 615,633,638,654,682,738,761 3*** | 813,813,820,834,860,872,897,914,918,955,994 4*** | 018,057,073,095,154,192 4*** | 238,271,342,377,379,387 4*** | 425,426,494,545 4*** | 4*** | 814 5*** | 009 5*** | 273,379 5*** | 501 5*** | 779
Page 11 of 12
Histogram The range of values is divided into a finite set of class intervals known as “bins.” The number of values in each bin is then counted and divided by the sample size to obtain frequency of occurrence. The frequency is plotted as vertical bars of varying height. Some programs allow the user to set the number of bins that appear. The frequencies can be divided by the bin width to obtain frequency densities that can be compared to probability densities from a theoretical distribution, such as the Gaussian distribution. For example, the Gaussian probability density function is superimposed on the frequency histogram of the charitable giving of 100 individuals.
24 22 20 18
16 14 12 10 8 6 4 2 0 0
4,000 Charitable Giving
Page 12 of 12