Descriptive Statistics and Exploratory Data Analysis Dean’s Faculty and Resident Development Series UT College of Medicine Chattanooga Probasco Auditorium at Erlanger January 14, 2008 Marc Loizeaux, PhD Department of Mathematics University of Tennessee at Chattanooga
What is descriptive statistics? Descriptive statistics describes your data. Visual and Numerical
Inferential statistics draws inferences about a larger population. Estimation and hypothesis testing
The Big Picture Statistics
Why descriptive statistics? To summarize our data To help us get to know our data To help us describe our data to an audience To help us explore our data.
What is Exploratory Data Analysis? “Exploratory data analysis is detective work – numerical detective work – or counting detective work – or graphical detective work” - John Wilder Tukey,
Exploratory Data Analysis, page 1
Exploring our data Gives us an overall view Helps us consider basic assumptions Helps us spot oddball values Helps us avoid embarrassing oversights May help us decide on the next step
Visual Descriptions (Tools for exploring your data visually)
Charts and Graphs – Histogram – Dotplot – Stem and leaf plot – Boxplot – Scatterplot – And many more
A simple example Grades on the first exam
Numerical Descriptions (Univariate, interval data) We want to describe…. – The central tendency of the data What is a center point for the data? What is a typical score?
– The variation of the data? How much spread is there to the data? How far apart are the data values from each other?
Measures of Central Tendency The mean is the arithmetic average. – Easy to calculate, easy to understand – The balance point of the data
The median is the score in the middle. – Resistant to extreme scores
Measures of Dispersion The range. – Easy to calculate and quick Range = high score – low score – Limited – only considers two scores
The standard deviation. – More complicated, but… – Indicates a “typical” deviation from the mean
Childhood Respiratory Disease (playing with the data) Data available from OzDASL, StatSci.org FEV (forced expiratory volume) is an index of pulmonary function that measures the volume of air expelled after one second of constant effort. The data: determinations of FEV on 654 children ages 6-22 who were seen in the Childhood Respiratory Desease Study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children. Source: – Tager, I. B., Weiss, S. T., Rosner, B., and Speizer, F. E. (1979). Effect of parental cigarette smoking on pulmonary function in children. American Journal of Epidemiology, 110, 15-26. – Rosner, B. (1990). Fundamentals of Biostatistics, 3rd Edition. PWS-Kent, Boston, Massachusetts.
Some of the Data ID
Descriptive Statistics Age
Pictures may say more
The ages look like this
One variable, then two… A univariate exploration – Explore each data column individually
A multivariate exploration – Explore the relationships between two data columns
Consider natural subgroups
Raising more questions?
It starts to make sense
Something else to study?
Preparing for an Audience Some Do’s – Pick and choose your graphs – Include appropriate numbers for your type of data – Include narrative Does the histogram indicate asymmetry? Are there unexpected values in the data set? Are there special problems you had to deal with to describe the data?
Preparing for an Audience (2) Some Don’ts – Don’t include everything – that just confuses us. – Don’t be redundant – some graphs say the same thing. – Don’t include descriptors you don’t understand (kurtosis?) – ask the chauffeur
Points to Remember (in no particular order)
Don’t skip the simple stuff! Spend time playing with your data. Pictures say a lot. Describe the spread as well as the center. Consider the natural subgroups in your data.
Next Time Confidence Intervals, Hypothesis Tests, and Statistical Significance 2 x 2 tables Monday, February 11