Types of Data, Descriptive Statistics, and Statistical Tests - ACCP

Loading...

Types of Data, Descriptive Statistics, and Statistical Tests for Nominal Data

Patrick F. Smith, Pharm.D. University at Buffalo Buffalo, New York

~.. 1

\ NONPARAMETRIC

I.

STATISTICS

DEFINITIONS A. Parametric statistics 1. Variable of interest is a measured quantity. 2. Assumes that the data follow some distribution which can be described by specific parameters a. Typically a normal distribution 3. Example: There are an infinite number of normal distributions, all which can be uniquely defined by a mean and standard deviation (SD). B. Nonparametric statistics 1. Variable of interest is not measured quantity. Mean and SD have little meaning. 2. Does not make any assumptions about the distribution of the data 3. "Distribution-free" statistics C. Dependent variable 1. The variable of interest, the outcome of which is dependent on something else D. Independent variable 1. The variable that is being tested for an effect on the dependent variable E. Example 1. Does high-dose ciprofloxacin lead to seizures? a. Seizures = dependent variable b. Dose = independent variable

II.

PARAMETRIC STATISTICS A. Developed primarily to deal with categorical data (non-continuous data) 1. Example: disease vs no disease; dead vs alive B. Nonparametric statistical tests may be used on continuous data sets. 1. Removes the requirement to assume a normal distribution 2. However, it also throws out some information, as continuous data contains information in the way that variables are related.

Some Commonly Used Statistical Tests Normal theory-based tests t test for independent samples Paired t test Pearson correlation coefficient One-way analysis of variance (F test) Two-way analysis of vanance

1

Corresponding nonparametric tests Mann-Whitney U test; Wilcoxon rank sum test Wilcoxon matched pairs signed., rank test Spearman rank correlation coefficient Kruskal-Wallis analysis of variance by ranks Friedman two-way analysis of variance

Purpose of test. Compares two independent samples Examines a set of differences Assesses the linear association between two variables Compares three or more groups Compares groups classified by two different factors

---- \

-

III.

NONP ARAMETRIC PROS AND CONS

A. Nonparametric pros 1. Nonparametric tests make less stringent demands ofthe data. a. For a parametric test to be valid, certain underlying assumptions must be met. i. example: For a paired t test, assume that: data are drawn ITomnormal distribution; every observation is independent of each other, and the SDs of the two populations are equal. Data are continuous. b. Nonparametric tests do not require these assumptions. i. can be used to evaluate data that are not continuous ii. no assumptions about distributions, independence, etc. B. Nonparametric cons 1. If using for a continuous data set, nonparametric tests throw information inherent in continuous data. 2. Reduces power to detect a statistical difference a. A more conservative approach 3. Example: For data IToma normally distributed population, if the Wilcoxon signed-rank test requires 1000 observations to demonstrate statistical significance, a t test will only require 955. IV.

CONTINGENCY

TABLES

A. Contingency tables are used to examine the relationship between subjects' scores on two qualitative or categorical variables. B. One variable determines the row categories; the other variable defines the column categories. C. Example: In studying the association between smoking and disease, the row categories in the figure below denote the categories of smoking status while the columns denote the presence or absence of disease.

Smoke

v.

Yes No

A Disease Yes No 13 37 6 144

B Disease Yes No 26% 74% 4% 96%

100% 100%

cm-SQUARED TEST A. Commonly used procedure, uses contingency tables B. Used to evaluate unpaired samples (unrelated groups) C. Often used to evaluate proportions D. Is there a difference in the proportion of viral infections in patients administered a vaccine? (12/100 vs. 2/100) E. Assumes nominal data (no ordering between variable groups)

j F. Limited when the numbers of subjects in any "cell" is low (rule of thumb, <5) G. Generallogic 1. Given two groups (vaccine vs control), the EXPECTED infection rate if the vaccine has no effect would be equal among the two groups. This is the null hypothesis. The chi-squared test compares the EXPECTED frequency of a particular event to the OBSERVED frequency in the population of interest. H. Formulas

x2

= L (0-E)2 E

with df= (r -l)(c -1) ExpectedFrequencies(E) for eachcell:

.

. . Ti X T E1J = N J

I.

Distribution

18 16 14 12 10 08 06 04 02 0 0

4

8

12

16

20

24

Chi-Square distribution

Chi-squared, by strict definition, is not a true nonparametric test. It assumes a distribution that can be described by a single parameter, degrees of freedom. J.

~

Chi-squared example problems (refer to Example Problem handout)

~

..

L

~ J.

Chi-squared example problems (refer to Example Problem handout) FISHER'S EXACT TEST

VI.

A. Alternative to chi-squared for 2 x 2 contingency tables 1. Improves accuracy when expected frequencies are small «5) or sample size is small (n=20) 2. Calculates exact probabilities

b d (b+d)

a c (a + c)

(a+b)! p(outcome)=

VII.

(a +b) (c + d) N

(c+d)! (a+c)! (b+d)!

N! a! b! c! d!

MCNEMAR'S TEST OF SYMMETRY

A. Chi-squared test requires samples to be independent of each other. B. McNemar's test is used when samples are related (similar to paired t test). C. There.are often times where measures may be repeated.

D. Example. Does drug X cause insomnia? 1. Patients may be questioned about insomnia before and after starting the drug. 2. The researcher asks the question, "Do more patients have insomnia since starting the drug?" E. Refer to Example Problems handout VIII.

KRUSKAL-W ALLIS TEST

A. Compares two independent samples B. Values of a variable are transformed to ranks. 1. Tests that there is no shift in the center of the groups (that is, the centers do not differ) C. If there are only two groups, the procedure reduces to the Mann-Whitney test-the analogue of the unpaired t test.

IX.

WILCOXON SIGNED-RANK TEST A. Nonparametric analogue of the paired t test B. Compares the rank values of variables pair-by-pair 1. The sum of the ranks associated with positive and negative differences is computed. 2. The test statistic is the lesser of the two sums of ranks. C. Refer to Example Problems handout

~

=:;

J. VI.

Chi-squared example problems (refer to Example Problem handout)

~~

FISHER'S EXACT TEST' A. Alternative to chi-squared for 2 x 2 contingency tables 1. Improves accuracy when expected frequencies are small «5) or sample size is small (n=20) 2. Calculates exact probabilities

a c (a + c)

b d (b + d)

=

(a+b)!

p(outcome)

VII.

(a +b) (c + d) N

(c+d)! (a+c)! (b+d)!

N! a! b! c! d!

MCNEMAR'S TEST OF SYMMETRY

A. Chi-squared test requires samples to be independent of each other. B. McNemar's test is used when samples are related (similar to paired t test). C. There' are often times where measures may be repeated.

D. Example. Does drug X cause insomnia? 1. Patients may be questioned about insomnia before and after starting the drug. 2. The researcher asks the question, "Do more patients have insomnia since starting the drug?" E. Refer to Example Problems handout VIII.

KRUSKAL-WALLIS TEST

A. Compares two independent samples B. Values of a variable are transformed to ranks. 1. Tests that there is no shift in the center of the groups (that is, the centers do not differ) C. If there are only two groups, the procedure reduces to the Mann-Whitney test-the analogue of the unpaired t test. IX.

WILCOXON SIGNED-RANK TEST A. Nonparametric analogue of the paired t test B. Compares the rank values of variables pair-by-pair 1. The sum of the ranks associated with positive and negative differences is computed. 2. The test statistic is the lesser of the two sums of ranks. C. Refer to Example Problems handout

:::;-

~

~

X. SPEARMAN RANK CORRELATION COEFFICIENT A. Nonparametric analogue oflinear regression and the correlation coefficient

Nonparametric analogue oflinear regression and the correlation coefficient (r)

rs

=1- 6L:d2 n 3 -n

d = difference of ranks at each point

B. Height 31 32 33 34 35 35 Rs = 6(-e+

Rank 1 2 3 4 5.5 5.5

Weight 7.7 8.3 7.6 9.1 9.6 9.9

Rank 2 3 1 4 5 6

d -1 -1 2 0 0.5 -0.5

-12+ 22+ 0 + 0.52+- 0.52)/63 - 6) = 0.81

For statistical significance, can look up critical values from table or obtain from software package.

-

~

-s:

.-= rt Example Problem 1: Association between tryptophan dietary supplements and eosinophiliamyalgia syndrome (EMS). A number of subjects from a particular area are evaluated; 80 patients with EMS were identified, along with 200 matched controls. Is there a statistically significant association between tryptophan use and EMS?

.

Unrelated groups, categorical (yes/no) data - chi-squared is appropriate

Observed

Results: EMS

42 38 80

Yes I

Tryptophan use

No

I

I

Total

No EMS

I

34 166 200

Total

I

76 204 280

(42 of76 patients taking tryptophan had EMS, compared to 38 of 204 not taking tryptophan) Expected values if no association exists (null hypothesis):

Yes No

Tryptophan use

Total

EMS 21.7 58.3 80

No EMS 54.3 145.7 200

Total 76 204 280

The rate of EMS in the overall population, assuming no effect, would be 80/280 (28.6%). (.286*76 = 21.7; .286x204 = 58.3). The No EMS cells can then be calculated from subtracting the total (ex: 76 - 21.7 = 54.3). E 11-- 76x80 280

E21

= 204x80 280

E 12 -- 76x200 280

E22

= 204x200 280

To evaluate significance,one needs a mean and measu:eof dispersion(ex. - standard deviation, standard error, variance, etc.). The chi-squared test is based on a Poisson distribution, where mean = variance); therefore,the chi-squaredtest assumes that the variance is equal to the expected mean value.

x2 X2

= I, (0-E)2 E

Therefore, in this example:

= (42/21.7i/21.7 + (34-54.3i/54.3 + (38-58.3i/58.3 + (166-145.7)2/145;7= 36.4

-7 Look up the result in a chi-squared table (a 2 x 2 contingency table has 1 degree of freedom). To be significant at the 0.05 level, X2must be > 3.84. Since 36.4 » 3.84, the result is highly significant.

-~

..

- (

Critical Values for the Chi-Squared df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

0.10 2.7055 4.6052 6.2514 7.7794 9.2363 10.6446 12.017 13.3616 14.6837 15.9872 17.275 18.5493 19.8119 21.0641 22.3071 23.5418 24.769 25.9894 27.2036 28.412 29.6151 30.8133 32.0069 33.1962 34.3816 35.5632 36.7412 37.9159 39.0875 40.256 41. 4217 42.5847 43.7452 44.9032 46.0588 47.2122 48.3634 49.5126 50.6598 51.805 52.9485 54.0902 55.2302 56.3685 57.5053 58.6405 59.7743 60.9066 62.0375 63.1671

Significance Level 0.05 0.025 5.0239 3.8415 7.3778 5.9915 7.8147 9.3484 9.4877 11.1433 11.0705 12.8325 12.5916 14.4494 16.0128 14.0671 15.5073 17.5345 16.919 19.0228 18.307 20.4832 19.6752 21.92 21.0261 23.3367 24.7356 22.362 26.1189 23.6848 24.9958 27.4884 28.8453 26.2962 27.5871 30.191 28.8693 31.5264 32.8523 30.1435 31.4104 34.1696 32.6706 35.4789 36.7807 33.9245 35.1725 38.0756 36.415 39.3641 37.6525 40.6465 41.9231 38.8851 43.1945 40.1133 41.3372 44.4608 45.7223 42.5569 43.773 46.9792 44.9853 48.2319 46.1942 49.4804 50.7251 47.3999 48.6024 51.966 49.8018 53.2033 54.4373 50.9985 52.1923 55.668 53.3835 56.8955 54.5722 58.1201 59.3417 55.7585 56.9424 60.5606 58.124 61.7767 59.3035 62.9903 60.4809 64.2014 61.6562 65.4101 66.6165 62.8296 64.0011 67.8206 69.0226 65.1708 66.3387 70.2224 67.5048 71. 4202

Distribution 0.01 6.6349 9.2104 11.3449 13 .2767 15.0863 16.8119 18.4753 20.0902 21.666 23.2093 24.725 26.217 27.6882 29.1412 30.578 31.9999 33.4087 34.8052 36.1908 37.5663 38.9322 40.2894 41.6383 42.9798 44.314 45.6416 46.9628 48.2782 49.5878 50.8922 52.1914 53.4857 54.7754 56.0609 57.342 58.6192 59.8926 61.162 62.4281 63.6908 64.95 66.2063 67.4593 68.7096 69.9569 71.2015 72.4432 73.6826 74.9194 76.1538

0.005 7.8794 10.5965 12.8381 14.8602 16.7496 18.5475 20.2777 21.9549 23.5893 25.1881 26.7569 28.2997 29.8193 31.3194 32.8015 34.2671 35.7184 37.1564 38.5821 39.9969 41.4009 42.7957 44.1814 45.5584 46.928 48.2898 49.645 50.9936 52.3355 53.6719 55.0025 56.328 57.6483 58.9637 60.2746 61.5811 62.8832 64.1812 65.4753 66.766 68.0526 69.336 70.6157 71.8923 73.166 74.4367 75.7039 76.9689 78.2306 79.4898

\

Eample Problem 2: A sociological study evaluated the characteristics of marriage by religion; 256 people were surveyed for religion and marital status. The results were as follows:

Protestant

Never Married Divorced

Separated Total

Jewish 8 11 3 1 23

Catholic

29 75 21 8 133

16 21 6 3 46

None 20 19 13 0 52

Other 0 1 0 1 2

Total 73 127 43 13 256

Is there a relationship between marital status and religion? SYSTAT

WARNING:

-

chi-squared output

More than one-fifth Significance

tests

of fitted cells computed

Test statistic Pearson

on

this

are sparse table

Value

chi-squared

22.718

are

(frequency

<

5).

suspect.

df 12.000

Prob 0.030

What happened??

Omitting sparse cells: Leave out 'other' and 'separated': Protestant

Catholic

29 75 21 125

Never Married Divorced

Total

Test statistic Pearson

chi-sguared

16 21 6 43

Value 10.368

Jewish 8 11 3 22

df 6.000

None 20 19 13 52

Total 73 126 43 242

prob 0.110

There is no statistically significant difference between the groups (p=O.11)

Example Problem 3: McNemar Test of Symmetry In November of 1993, the U.S. Congress approved the North American Free Trade Agreement (NAFTA). Let's say that two months before the approval and before the televised debate between Vice President Al Gore and businessman Ross Perot, political pollsters queried a sample of 350 people, asking "Are you for, unsure, or against NAFTA?" Immediately after the debate, the pollsters contacted the same people and asked the question a second time. Here are the results:

BEFORE$

(rows)

by

AFTER$

for 51 46 52 149

for unsure against Total

Percents BEFORE$

of

total

(rows)

for unsure against Total N

(columns)

unsure 22 18 49 89

Total 101 91 158 350

against 28 27 57 112

count by

AFTER$

(columns)

for 14.571 13.143 14.857 42.571 149

unsure 6.286 5.143 14.000 25.429 89

AFTER against 8.000 7.714 16.286 32.000 112

Test statistic McNemar

Pearson Symmetry

chi-squared chi-squared

Value 11.473 22.039

N 101 91 158

Total 28.857 26.000 45.143 100.000

350

df

Prob 4.000 3.000

0.022 0.000

The McNemar test of symmetry focuses on the counts in the off-diagonalcells (those along the diagonal are not used in the computations). We are investigating the direction of change in opinion. First, how many respondentsbecame more negative aboutNAFTA? Among those who initially responded For, 22 (6.29%) are now Unsure and 28 (8%) are now Against. Among those who were Unsure before the debate, 27 (7.71%) answered Against afterwards. The three cells in the upper right contain counts for those who became more unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in the lower left contain counts for people who became more positive about NAFTA (46, 52, and 49) or 42% of the sample. The null hypothesis for the McNemar test is that the changes in opinion are equal. The chisquared statistic for this test is 22.039 with 3 df and p<0.0005. You reject the null hypothesis. The pro-NAFTA shift in opinion is significantly greater than the anti-NAFTA shift.

-r

Example Problem 4: Wilcoxon Signed-Rank Test Evaluate the effect of a diuretic in healthy volunteers:

Subject

No drug

1 2 3 4 5 6

1600 1850 1300 1500 1400 1010

Daily UOP + Drug 1490 1300 1400 1410 1350 1000

Difference -110 -550 +100 -90 -50 -10

Rank of difference 5 6 4 3 2 1

Signedrank of difference -5 -6 +4 -3 -2 -1

W = sum of signed ranks = -13 If the drug has no effect, the ranks associated with a positive change should be similar to the ranks associated with a negative change; hence, the sum (W) should = O. How large must W be to call this a statistically significant difference? Refer to Critical Values table: N 5 6 7 8 9 10 11 12 13 14 15

Critical Value 15 21 19 28 24 32 28 39 33 45 39 52 44 58 50 65 57 73 63 80 70

P .062 .032 .062 .016 .046 .024 .054 .020 .054 .02 .048 .018 .054 .02 .052 .022 .048 .02 .05 .022 .048

*Due to the nature of discrete possible values ofW, p values at traditional breakpoints are usually not possible (ex.: p=0.05).

Loading...

Types of Data, Descriptive Statistics, and Statistical Tests - ACCP

Types of Data, Descriptive Statistics, and Statistical Tests for Nominal Data Patrick F. Smith, Pharm.D. University at Buffalo Buffalo, New York ~...

4MB Sizes 67 Downloads 13 Views

Recommend Documents

Descriptive statistics and normality tests Column statistics
Nov 18, 2009 - Descriptive statistics. Normality tests. Normality tests are performed for each column of data. Each norm

Descriptive Statistics and Descriptive Statistics and Exploratory Data
Jan 14, 2008 - What is descriptive statistics? Descriptive statistics describes describes your data. Visual and Numerica

Introduction to Statistical Methods Descriptive Statistics
Statistical Methods. Descriptive Statistics. Example: "The average income of the 104 families in our company is $28,673.

1. Statistics, Data and Statistical Thinking - ICMAt
Website: www.icmat.es/miembros/amartin. 1. Statistics is the science of data. This involves collecting, classifying, sum

Data presentation and descriptive statistics - OS3.nl
Oct 6, 2010 - Sep.06 2010 - Slide 7. Roadmap for today and next week. • Collecting data. • Presenting data. • Desc

Descriptive Statistics - Department of Mathematics and Statistics
Introduction to Probability and Statistics. Slides 3 – Chapter 3. Ammar M. Sarhan, [email protected] Department

Descriptive Statistics - Department of Mathematics and Statistics
Overview & Descriptive Statistics. ○ Probability. ○ Discrete Random variables and Distributions. ○ Continuous Rand

Branches of Statistics - Descriptive and Interferential Statistics
Every student of statistics should know about the different branches of statistics to correctly understand statistics fr

12 Statistical Properties of Descriptive Statistics - TAMU Stat
realization,. ˆ θ has very high probability of being very close to θ. Distributions, Confidence Intervals, Tests of H

Descriptive Statistics and Interpreting Statistics - Statistics Solutions
Descriptive Statistics and Interpreting Statistics Resources. Bartz, A. E. (1971). Basic descriptive statistics for educ