Loading...

I.

Indicators of Central Tendency A.

Mode

B.

Median

C.

Mean

II. Indicators of Dispersion A.

Range

B.

Interquartile Range

C.

Variance

D.

Standard Deviation

III. Graphic Presentation and Summarization A.

Sort raw data

B.

Frequency table

C.

Reduce raw data to categories

D.

Cumulative frequencies & percentiles

E.

Histograms

IV. Exploratory Data Analysis A.

Box and whisker plot

B.

Stem and leaf display

Page 1 of 12

Descriptive Statistics

Displaying the Shape of the Distribution Goal: Determine how closely does the shape of the distribution approximates a Gaussian distribution. “Parametic” statistical tests — the kind we will study next —assume the data do indeed approximate a Gaussian distribution.

V. Indicators of a Gaussian distribution A.

Mean = Median = Mode

B.

Skewness:

1 b 1 = --- • Σ • n

x i – x 3 ----------- s

measures the asymmetry of the distribution. A value of

zero indicates no skewness is present. The larger the value the more skewed the distribution. Negative skew indicates the tail of the distribution is to the left, with most of the scores clustering at the higher end of the scale. Positive skew indicates the scores cluster at the low end of the scale and the tail extends to the right. C.

Kurtosis: 1. 2. 3.

1 b 2 = --- • Σ • n

x i – x 4 ----------- s –3

indicates the flatness of the distribution.

Mesokutric: = 3 Platykurtic: < 3 Leptokurtic > 3

D.

Graphs 1. Ogive 2. Normal Probability Plots

E.

Statistical Tests 1. Chi Square

VI. Resistant indicators A. Central Tendency In certain data sets some observed values lie far way from the clump of the data values. These “outliers” or “extreme” scores, may be due to measurement errors, data recording errors, or may represent valid data points. Extreme scores influence unduly the mean and standard deviation. Suppose for example, that the mean annual salary in this class is $59,000. Now, imagine that for some reason Bill Gates decides to join our class. When we include his, say, $10,000,000 annual salary, we are now all millionaires, for the class mean is now $x,xxx,xxx. The mean is no longer descriptive of the average, for the mean is not resistant to extreme scores. Hence, use the median instead. The median is not influenced

Page 2 of 12

Descriptive Statistics

by the exact value of the largest score (or value) and thus is a more resistant measure of central tendency. B.

Dispersion. The range, clearly, is not resistant to the influence of extreme scores. Because each value in a distribution is included in the calculation of the variance and standard deviation, neither is resistant to extreme values. The interquartile range, because it is based on percentiles, is resistant to extreme scores. The lower quartile is the value such that 25 percent of all values fall below that value. The upper quartile is the value at which 25 percent of all values fall above it. The interquartile range is the difference between the upper and lower quartiles. In a large sample that approximates the Gaussian distribution, the interquartile range tends to be 1.34 times the sample standard deviation.

C.

Shape of the distribution Resistent indicators of skewness and kurtosis also exist, such as the Yule-Kendall skewness statistic defined as:

x 0.25 – ( 2x 0.5 + x 0.75 ) ϒYK = -------------------------------------------------x0.75 – x 0.25

Other resistant indicators exist based on all the quantities such as L-moments but these are not included in an introductory discussion.

Page 3 of 12

Descriptive Statistics

Calculation of Mean and Standard Deviation Sample of 10 Scores from P102 Exam

Person A B C D E F G H I J

Score (x)

(x-M)

67 95 98 92 99 96 94 90 95 75

sum = mean =

901 90.1

skewness = kurtosis =

-1.6557 1.7794

-23.1 4.9 7.9 1.9 8.9 5.9 3.9 -0.1 4.9 -15.1 sum -> variance -> standard deviation ->

Note that the mean is the arithmetic average

Σx i -------- . n

(x-M)2 533.61 24.01 62.41 3.61 79.21 34.81 15.21 0.01 24.01 228.01 1,005 100.49 166.17

The column labelled (x-M) shows the

amount by which each score deviates from the mean. This column will always sum to zero. The column labelled (x-M)2 is also known as the “sum of the squared deviations about the mean,” or just as “sum of squares.” The variance is the average of the sum of squares 2

Σ(x – M ) ------------------------- . n–1

and the standard deviation is the square root of the variance

2

Σ(x – M ) ------------------------n–1

.

To illustrate the impact of an extreme score, the instructor realizes that for student A, the score of 67 was mistakenly entered. In actuality, student A earned a score ot 57. Note the changes in the descriptive statistics when this single change is made.

Page 4 of 12

Descriptive Statistics

Effect of an Extreme Score Sample of 10 Scores from P102 Exam

Person

Score (x)

A B C D E F G H I J

57 95 98 92 99 96 94 90 95 75

sum = mean =

891 89.1

skewness = kurtosis =

-2.0341 3.8474

(x-M) -32.1 5.9 8.9 2.9 9.9 6.9 4.9 0.9 5.9 -14.1

sum -> variance -> standard deviation ->

2

(x-M) 1030.41 34.81 79.21 8.41 98.01 47.61 24.01 0.81 34.81 198.81 1,557 155.69 312.71

Note the changes in the descriptive statistics presented below. The mean changes slightly (about one percent), as you would expect due to an extreme score, but the median remains unchanged. This illustrates the meaning of “resistant” indicator. The standard deviation shows a 24 percent increase, skewness and kurtosis also show large changes, suggesting the shape of the distribution departs even further from the Gaussian. Original Data Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range

90.10 3.34 94.50 95.00 10.57 111.66 1.78 -1.66 32.00

One Extreme Score 89.10 4.16 94.50 95.00 13.15 172.99 3.85 -2.03 42.00

Page 5 of 12

Descriptive Statistics

Here is how skewness is calculated by hand for a different set of data:

Skewness

1. List Raw Scores in a column

sum = y = mean = ( y)/n = M st dev = ¥var

y 8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 82.51 7.50 2.03

2. Subtract Mean from each Raw Score. Aka, Deviations from the mean

(y - M) 0.54 -0.55 0.08 1.31 0.83 2.46 -0.26 -3.24 3.34 -2.68 -1.82 0.00

4. Calculate skewness, which is the 3. Raise each of these sum of the deviations from the mean, deviations from the raise to the third power, divided by mean to the third power number of cases minus 1, times the and sum. Aka: Sum of standard deviation raised to the third third moment deviations power.

(y - M)3 0.16 -0.17 0.00 2.24 0.57 14.87 -0.02 -34.04 37.23 -19.27 -6.04 3 -4.46 sum = deviations = (n-1) stdev3 83.65 -0.0533 = skewness

Calculating Skewness: 1. First, calculate the mean and standard deviation 2. Subtract the mean from each raw score and cube (i.e., raise to the third power) 3. Sum the cubed deviations. 4. Multiply the number of scores minus 1 times the cubed standard deviation (i.e., raised to the third power). 5. Skewness = step 3 divided by step 4

Page 6 of 12

Descriptive Statistics

Keep in mind that if a distribution is positively skewed, the bulk of the values clump around the lower end of the scale with a few trialing off at the high end. Conversely, in a negatively skewed distribution, the bulk of the values clump around or near the high end of the scale with a few values trailing off at the low end. The following table summarizes the descriptive statistics for the P102 sample.

Table 1: Summary Statistics for P102 Exam Data Statistic

Symbol

Value

sample size

n

10

mean

x

90.1

standard deviation

sx

166.17

x max – x m in

32

skewness

b1

-1.66

non-resistant measure of skewness

kurtosis

b2

1.78

non-resistant measure of kurtosis

94.5

resistant measure of location

x 0.75 – x 0.25

10.25

resistant measure of dispersion

ϒYK

-0.19

resistant measure of skewness

range

median interquartile range Yule-Kendall

x 0,

5

Comment number of cases/individuals non-resistant measure of location non-resistant measure of dispersion non-resistant measure of scale

Page 7 of 12

Descriptive Statistics

−4 sd

−3 sd

−2 sd

−1 sd

mean

1 sd

2 sd

3 sd

2

The equation for the Gaussian curve is

1 y = ------------- e σ 2π

–( x – µ ) ---------------------2 2σ

. where:

y = The height of the curve at a given value of x σ

= The standard deviation of the distribution.

π

= A constant (pi) of approximately 3.1416

x = A specific score within the distribution. e = The base of the Napierian logarithms, approximately 2.71828 µ

= The mean of the distribution.

2

= The variance of the distribution.

σ

Page 8 of 12

4 sd

Descriptive Statistics

Box Plots Box plots are useful in visualizing distributions. Consider the following scattergram of per capita income for each of the 50 states (y axis) with charitable deductions (x axis) listed on 1998 itemized tax returns.

Per Capita Income

30,000

25,000

20,000

15,000 0

2,000

4,000

6,000

Charitable Giving

An explanation of the box plot appears on the following page. The line or asterisk within the box is the median of the distribution. Fifty percent of the cases fall with the upper and lower hinges (the box boundaries). The upper hinge occurs at the 75th percentile, which is the third quartile, which corresponds to a z-score of .68. As discussed earlier, the median occurs at the 50th percentile, which is the second quartile and corresponds to a z-score of zero. The lower hinge occurs at the 25th percentile, which is the first quartile and corresponds to a z-score of — .68. The “whiskers” terminate at the largest and smallest values that are not considered to be outliers. The definitions for “outlier” and “extreme” scores may vary depending on the software program. A common definition for outlier is any value 1.5 box-lengths above or below the upper and lower hinges, and for extreme scores, any value more than 3 box-lengths above or below the upper or lower hinges respectively.

Page 9 of 12

Descriptive Statistics

In the charatible giving example one of the states (that shall remain nameless) has a high per capita income (around $27,000) but gives only about $1,000 to charity. Notice that the circle for this pair of data points lies beyond the whisker of the “charatible giving” box.

Page 10 of 12

Descriptive Statistics

Stem and Leaf Another useful data display is know as the stem and leaf. This is a simple way of displaying the distribution of data without having to use computer graphics. The characteristic that makes the stem and left unique is that very value in the data set is displayed. The stem and leaf “plot” groups the values in a data set according to their all but least significant digits. These are written in ascending or descending order to the left side of a vertical bar and are know as the “stem.” The “leaves” are formed by writing the least significant digit to the right of the vertical bar, on the same line as the more significant digits with which it belongs. The stem and leaf plot below shows the charitable giving for 100 individuals. We can see that least amout one person gave was $1,082 while the most one person gave was $5,779. Further, we can see that in the $4,000 range, the following exact values were given: $4,018, $4,057, $4,073, $4,095 . . . $4,814. The stem and leaf with vary slightly in appearance depending on the specific software used. Some programs enable you to examine the leaves in detail, by reporting the number of cases, the spread, the value of the lower and upper hinges, etc. 1*** | 082 1*** | 303 1*** | 1*** | 785 1*** | 870,976,985 2*** | 012,040,116 2*** | 212,242,256,296,308 2*** | 448,482,511,511,530,560 2*** | 609,632,686,718,740,785 2*** | 806,829,833,871,885,899,951,963 3*** | 001,010,015,028,030,088,164,170,171,178 3*** | 225,229,237,277,310,358,385,392 3*** | 413,414,439,450,450,502,519,594 3*** | 615,633,638,654,682,738,761 3*** | 813,813,820,834,860,872,897,914,918,955,994 4*** | 018,057,073,095,154,192 4*** | 238,271,342,377,379,387 4*** | 425,426,494,545 4*** | 4*** | 814 5*** | 009 5*** | 273,379 5*** | 501 5*** | 779

Page 11 of 12

Descriptive Statistics

Histogram The range of values is divided into a finite set of class intervals known as “bins.” The number of values in each bin is then counted and divided by the sample size to obtain frequency of occurrence. The frequency is plotted as vertical bars of varying height. Some programs allow the user to set the number of bins that appear. The frequencies can be divided by the bin width to obtain frequency densities that can be compared to probability densities from a theoretical distribution, such as the Gaussian distribution. For example, the Gaussian probability density function is superimposed on the frequency histogram of the charitable giving of 100 individuals.

24 22 20 18

Frequency

16 14 12 10 8 6 4 2 0 0

2,000

4,000 Charitable Giving

Page 12 of 12

6,000