Loading...

Descriptive Statistics

Chapter Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Producing One-Way Frequencies . . . . . . . . . . . . . . . 136 Computing Summary Statistics . . . . . . . . . . . . . . . . 142 Examining the Distribution . . . . . . . . . . . . . . . . . . 146 Computing Correlations . . . . . . . . . . . . . . . . . . . . 151 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

134

Chapter 7. Descriptive Statistics

SAS OnlineDoc: Version 8

Chapter 7

Descriptive Statistics Introduction Descriptive statistics and plots are often used in the initial phase of a statistical analysis. These tools enable you to identify relationships in the data and to determine directions for further analysis.

Figure 7.1.

Descriptive Menu

The Analyst Application provides several types of descriptive statistics and graphical displays. The Summary Statistics task provides the following information: mean, median, standard error and standard deviation, variance, minimum, maximum, range, sum, skewness and kurtosis, student’s t and probability value, coefficient of variation, and sums of squares. Graphics in this task include histograms and box-and-whisker plots.

136

Chapter 7. Descriptive Statistics The Distributions task produces statistics such as moments and quantiles as well as measures of location and variability. You can request fitted distributions from the normal, lognormal, Weibull, and exponential distributions. Plots included are the box-and-whisker plot, histogram, probability plot, and quantile-quantile plots. Histograms can be superimposed with fitted curves from the distribution families. Probability and quantile-quantile plots are available for each of the distributions. The Correlations task gives you the choice of Pearson and Spearman correlations as well as Cronbach’s alpha, Kendall’s tau-b, and Hoeffding’s D. Scatter plots with optional confidence ellipses are available. The Frequency Counts task provides one-way frequency tables, which include frequencies, percentages, and cumulative frequencies and percentages. Horizontal and vertical bar charts are also available. The examples in this chapter demonstrate how you can use the Analyst Application to compute one-way frequency tables, obtain summary statistics, examine the distribution of your data, and compute correlations.

Producing One-Way Frequencies The data set analyzed in the following sections is taken from the 1995 Statistical Abstract of the United States. The data are measures of the birth rate and infant mortality rate for 1992 in the United States. Information is provided for the 50 states and the District of Columbia. The states are grouped by region. Here, these data are considered to be a sample of yearly data. Suppose you want to determine the frequency of occurrence of the various regions. In the following example, a listing of the frequencies and a bar chart are produced. In the Frequency Counts task, you can compute one-way frequency tables for the variables in your data set. For each value of your anal-

SAS OnlineDoc: Version 8

Producing One-Way Frequencies

137

ysis variable, Analyst produces the frequency, cumulative frequency, and cumulative percentage. You can control the order in which the values appear and specify group and count variables.

Open the Bthdth92 Data Set The data are provided in the Analyst Sample Library. To open the Bthdth92 data set, follow these steps: 1. Select Tools ! Sample Data : : : 2. Select Bthdth92. 3. Click OK to create the sample data set in your Sasuser directory. 4. Select File ! Open By SAS Name : : :

5. Select Sasuser from the list of Libraries. 6. Select Bthdth92 from the list of members. 7. Click OK to bring the Bthdth92 data set into the data table.

Request Frequency Counts To request frequency counts, follow these steps: 1. Select Statistics ! Descriptive ! Frequency Counts: : : 2. Select region as the frequencies variable from the candidate list. The default analysis provides the information desired. Note that you can use the Input dialog to select the specific ordering by which the variable values are listed. Figure 7.2 displays the Frequency Counts dialog with region specified as the frequencies variable.

SAS OnlineDoc: Version 8

138

Chapter 7. Descriptive Statistics

Figure 7.2.

Frequency Counts Dialog

Request a Horizontal Bar Chart To produce a horizontal bar chart in addition to the frequency counts, follow these steps: 1. Click on the Plots button. 2. Select Horizontal, as displayed in Figure 7.3. 3. Click OK to close the Plots dialog.

SAS OnlineDoc: Version 8

Producing One-Way Frequencies

Figure 7.3.

139

Frequency Counts: Plots Dialog

Click OK in the Frequency Counts main dialog to perform the analysis.

Review the Results The results are presented in the project tree under the Frequency Counts folder, as displayed in Figure 7.4. The three nodes represent the frequency counts output, the horizontal bar chart, and the SAS programming statements (labeled Code) that generate the output.

SAS OnlineDoc: Version 8

140

Chapter 7. Descriptive Statistics

Figure 7.4.

Frequency Counts: Project Tree

You can double-click on any node in the project tree to view the contents in a separate window. Note that the first output generated is displayed by default. Figure 7.5 displays the table of frequency counts for the variable region.

SAS OnlineDoc: Version 8

Computing Summary Statistics

Figure 7.5.

141

Frequency Counts: One-Way Frequencies of the Variable region

The table shows that about 33% of the observations in the data set are located in the southern region, and roughly 25% of the observations are located in the western and midwestern regions, respectively. Approximately 18% of the observations are located in the northeastern region. To display the bar chart of the frequency counts, double-click the node labeled Horizontal Bar Chart of REGION (Figure 7.6).

Figure 7.6.

Frequency Counts: Horizontal Bar Chart by Region SAS OnlineDoc: Version 8

142

Chapter 7. Descriptive Statistics

Computing Summary Statistics In this task, summary statistics (such as the mean, standard deviation, and minimum and maximum values) are desired for the birth and infant mortality rates for each region. In addition, box-and-whisker plots are requested.

Request Summary Statistics To request the Summary Statistics task, follow these steps: 1. Select Statistics ! Descriptive ! Summary Statistics: : : 2. Select the analysis variables birth and death from the candidate list. You can specify a classification variable to define groups within your data. When you specify a classification variable, the Analyst Application produces summary statistics for the analysis variables at each level of the classification variable. 3. Select region as the classification variable. Figure 7.7 displays the Summary Statistics main dialog with birth and death specified as the analysis variables and region specified as the classification variable.

SAS OnlineDoc: Version 8

Computing Summary Statistics

Figure 7.7.

143

Summary Statistics Dialog

Request Box-and-Whisker Plots To request box-and-whisker plots, follow these steps: 1. Click on the Plots button. 2. Select Box-&-whisker plot. 3. Click OK. Figure 7.8 displays the Plots dialog with Box-&-whisker plot selected.

Figure 7.8.

Summary Statistics: Plots Dialog

SAS OnlineDoc: Version 8

144

Chapter 7. Descriptive Statistics To perform the analysis, click OK in the main dialog.

Review the Results The results are presented in the project tree under the Summary Statistics folder, as displayed in Figure 7.9. The four icons represent the summary statistics output, the box-and-whisker plots for each analysis variable, and the SAS programming statements (labeled Code) that generate the output.

Figure 7.9.

Summary Statistics: Project Tree

Double-click on any of the icons to display the corresponding information in a separate window. Figure 7.10 displays, for each value of the classification variable region, the number of observations, the mean, the standard deviation, and the minimum and maximum values of each analysis variable. SAS OnlineDoc: Version 8

Examining the Distribution

145

The western region has the highest birth rate (16:89) and the southern region has the highest death rate (10:15).

Figure 7.10.

Summary Statistics: Statistics for birth and death

Figure 7.11 displays the box-and-whisker plot for the variable birth for each level of the region variable.

Figure 7.11.

Summary Statistics: Box-and-Whisker Plot for Birth Rate by Region

This plot reveals a possible outlier in the birth rate for the midwestern region (region=‘MW’). The western region (region=‘W’) is noticeable as the region with the highest birth rate.

SAS OnlineDoc: Version 8

146

Chapter 7. Descriptive Statistics

Examining the Distribution You can examine the distributional properties of your data with the Distributions task. This task enables you to produce descriptive statistics for the variables, test the fit of several distributions to your data, and examine displays such as histograms and probability plots. In this task, interest lies in examining the birth and infant mortality rates for each region.

Request a Distributions Analysis To request the Distributions task, follow these steps: 1. Select Statistics! Descriptive ! Distributions : : : 2. Select birth and death as the analysis variables. 3. Select region as the classification variable. Figure 7.12 displays the Distributions main dialog with the preceding variable specifications.

Figure 7.12.

SAS OnlineDoc: Version 8

Distributions Dialog

Examining the Distribution

147

The default analysis provides moments, quartiles, and measures of variability.

Request Plots To request box-and-whisker plots and histograms, follow these steps: 1. Click on the Plots button. 2. Select Box-&-whisker plot. 3. Select Histogram. 4. Click OK. Figure 7.13 displays the Plots dialog.

Figure 7.13.

Distributions: Plots Dialog

Request Fitted Distribution To fit a normal distribution to these data, follow these steps: 1. Click on the Fit button in the main dialog. 2. Select Normal. By default, parameter values are calculated from the data when you fit the normal distribution. If you want to enter specific parameter values, click on the down arrow (displayed in Figure 7.14) and select Enter values. For the lognormal, exponential, and Weibull

SAS OnlineDoc: Version 8

148

Chapter 7. Descriptive Statistics distributions, you can specify that parameters be calculated by maximum likelihood estimation (MLE), or you can enter specific parameter values. 3. Click OK.

Figure 7.14.

Distributions: Fit Dialog

When you have completed your selections, click OK in the main dialog to perform the analysis. The results are presented in the project tree displayed in Figure 7.15.

Review the Results Double-click on any of the resulting eight icons to display the corresponding output in a separate window.

SAS OnlineDoc: Version 8

Examining the Distribution

Figure 7.15.

149

Distributions: Project Tree

The Moments and Quantiles output provides summary information for each variable. Figure 7.16 displays the output labeled Fitted Distributions of Bthdth92, which summarizes how closely the normal distribution fits each variable, by region.

SAS OnlineDoc: Version 8

150

Chapter 7. Descriptive Statistics

Figure 7.16.

Distributions: Fitted Distributions Results

Based on the test results displayed in Figure 7.16, the null hypothesis that the variable birth is normally distributed cannot be rejected at the = 0:05 level of significance (p-values for all tests are greater than 0:15). The same is true for the variable death except for the southern region (region=‘S’). The hypothesis is rejected at the = 0:05 level of significance for the death rate in the southern region. Two sets of box plots and four sets of histograms are also produced. A single box-and-whisker plot is created for each of the two variables. The box-and-whisker plot for the variable birth is displayed when you double-click Box Plot of BIRTH in the project tree. Two histograms are created for each variable. Each graphic contains a histogram for two levels of the classification variable region. The first histogram contains the information for the midwestern and northeastern regions (region=‘MW’ and region=‘NE’), as displayed in Figure 7.17. The second histogram (not shown) contains the information for the southern and western regions (region=‘S’ and region=‘W’).

SAS OnlineDoc: Version 8

Computing Correlations

Figure 7.17.

151

Distributions: Histogram for birth

The normal curve overlaid on the histogram displayed in Figure 7.17 is the result of requesting a normal distribution fit in the Fit dialog (Figure 7.14). The statistical details of the fit are located in the output labeled Fitted Distributions of Bthdth92, which also includes the details of the fit for the variable death.

Computing Correlations You can use the Correlations task to compute pairwise correlation coefficients for the variables in your data set. The correlation is a measure of the strength of the linear relationship between two variables. This task can compute the standard Pearson product-moment correlations, nonparametric measures of association, partial correlations, and Cronbach’s coefficient alpha. The task also can produce scatter plots with confidence ellipses.

SAS OnlineDoc: Version 8

152

Chapter 7. Descriptive Statistics The following example computes correlation coefficients for four variables in the Fitness data set. This data set contains measurements made on groups of men taking a physical fitness course at North Carolina State University. The variables are as follows:

age

age, in years

weight

weight, in kilograms

oxygen

oxygen intake rate, in milliliters per kilogram of body weight per minute

runtime

time taken to run 1.5 miles, in minutes

rstpulse

heart rate while resting

runpulse

heart rate while running

maxpulse

maximum heart rate recorded while running

group

group number

This example includes looking at correlations between the variables runtime, runpulse, maxpulse, and oxygen and also producing the corresponding scatter plots with confidence ellipses.

Open the Fitness Data Set To open the Fitness data set, follow these steps: 1. Select Tools ! Sample Data : : : 2. Select Fitness. 3. Click OK to create the sample data set in your Sasuser directory. 4. Select File ! Open By SAS Name : : :

5. Select Sasuser from the list of Libraries. 6. Select Fitness from the list of members. 7. Click OK to bring the Fitness data set into the data table.

SAS OnlineDoc: Version 8

Computing Correlations

153

Request Correlations To compute correlations for variables in the Fitness data set, follow these steps: 1. Select Statistics ! Descriptive ! Correlations : : : 2. Select the variables runtime, runpulse, maxpulse, and oxygen to correlate. Figure 7.18 displays the resulting Correlations dialog.

Figure 7.18.

Correlations Dialog

If you click OK in the Correlations main dialog, the default output, which includes Pearson correlations, is produced. Or, you can request specific types of correlations by using the Options dialog.

SAS OnlineDoc: Version 8

154

Chapter 7. Descriptive Statistics

Request a Scatter Plot To request a scatter plot with a confidence ellipse, follow these steps: 1. Click on the Plots button. 2. Select Scatter plots. 3. Select Add confidence ellipses. The confidence level used in calculating the confidence ellipse is 0:95. To use a different level, type that value in the Probability value: field, as displayed in Figure 7.19. 4. Click OK.

Figure 7.19.

Correlations: Plots Dialog

Click OK in the main dialog to perform the analysis.

Review the Results The results are presented in the project tree, as displayed in Figure 7.20.

SAS OnlineDoc: Version 8

Computing Correlations

Figure 7.20.

155

Correlations: Project Tree

You can double-click on any of the resulting nodes in the project tree to view the information in a separate window. Figure 7.21 displays univariate statistics for each of the analysis variables. The table provides the number of observations, the mean, the standard deviation, the sum, and the minimum and maximum values for each variable.

SAS OnlineDoc: Version 8

156

Chapter 7. Descriptive Statistics

Figure 7.21.

Correlations: Univariate Statistics

Figure 7.22 displays the table of correlations. The p-value, which is the significance probability of the correlation, is displayed under each of the correlation coefficients. For example, the correlation between the variables maxpulse and runtime is 0:22610, with an associated p-value of 0:2213, and the correlation between the variables oxygen and runpulse is ,0:39797, with an associated p-value of 0:0266.

Figure 7.22.

SAS OnlineDoc: Version 8

Correlations: Table of Correlations

Computing Correlations

157

Six scatter plots, each of which includes a 95% confidence ellipse, are produced in this analysis. Each plot displays the relationship between one pair of the analysis variables. The scatter plot of runtime versus oxygen is displayed in Figure 7.23.

Figure 7.23.

Correlations: Scatter Plot with Confidence Ellipse

SAS OnlineDoc: Version 8

158

Chapter 7. Descriptive Statistics Confidence ellipses are used as a graphical indicator of correlation. When two variables are uncorrelated, the confidence ellipse is circular in shape. The ellipse becomes more elongated the stronger the correlation is between two variables.

References SAS Institute Inc. (1999), SAS Procedures Guide, Version 7-1, Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999), SAS/STAT User’s Guide, Version 7-1, Cary, NC: SAS Institute Inc. Schlotzhauer, Sandra D. and Littell, Ramon C. (1991), SAS System for Elementary Statistical Analysis, Second Edition, Cary, NC: SAS Institute Inc. U.S. Bureau of the Census (1995), Statistical Abstract of the United States, Washington, D.C.

SAS OnlineDoc: Version 8

The correct bibliographic citation for this manual is as follows: SAS Institute Inc., The Analyst Application, First Edition, Cary, NC: SAS Institute Inc., 1999. 476 pp. The Analyst Application, First Edition Copyright © 1999 SAS Institute Inc., Cary, NC, USA. ISBN 1–58025–446–2 All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, by any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute, Inc. U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of the software by the government is subject to restrictions as set forth in FAR 52.227–19 Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, October 1999 SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries.® indicates USA registration. IBM®, ACF/VTAM®, AIX®, APPN®, MVS/ESA®, OS/2®, OS/390®, VM/ESA®, and VTAM® are registered trademarks or trademarks of International Business Machines Corporation. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. The Institute is a private company devoted to the support and further development of its software and related services.