Types of Data, Descriptive Statistics, and Statistical Tests for Nominal Data
Patrick F. Smith, Pharm.D. University at Buffalo Buffalo, New York
~.. 1
\ NONPARAMETRIC
I.
STATISTICS
DEFINITIONS A. Parametric statistics 1. Variable of interest is a measured quantity. 2. Assumes that the data follow some distribution which can be described by specific parameters a. Typically a normal distribution 3. Example: There are an infinite number of normal distributions, all which can be uniquely defined by a mean and standard deviation (SD). B. Nonparametric statistics 1. Variable of interest is not measured quantity. Mean and SD have little meaning. 2. Does not make any assumptions about the distribution of the data 3. "Distributionfree" statistics C. Dependent variable 1. The variable of interest, the outcome of which is dependent on something else D. Independent variable 1. The variable that is being tested for an effect on the dependent variable E. Example 1. Does highdose ciprofloxacin lead to seizures? a. Seizures = dependent variable b. Dose = independent variable
II.
PARAMETRIC STATISTICS A. Developed primarily to deal with categorical data (noncontinuous data) 1. Example: disease vs no disease; dead vs alive B. Nonparametric statistical tests may be used on continuous data sets. 1. Removes the requirement to assume a normal distribution 2. However, it also throws out some information, as continuous data contains information in the way that variables are related.
Some Commonly Used Statistical Tests Normal theorybased tests t test for independent samples Paired t test Pearson correlation coefficient Oneway analysis of variance (F test) Twoway analysis of vanance
1
Corresponding nonparametric tests MannWhitney U test; Wilcoxon rank sum test Wilcoxon matched pairs signed., rank test Spearman rank correlation coefficient KruskalWallis analysis of variance by ranks Friedman twoway analysis of variance
Purpose of test. Compares two independent samples Examines a set of differences Assesses the linear association between two variables Compares three or more groups Compares groups classified by two different factors
 \

III.
NONP ARAMETRIC PROS AND CONS
A. Nonparametric pros 1. Nonparametric tests make less stringent demands ofthe data. a. For a parametric test to be valid, certain underlying assumptions must be met. i. example: For a paired t test, assume that: data are drawn ITomnormal distribution; every observation is independent of each other, and the SDs of the two populations are equal. Data are continuous. b. Nonparametric tests do not require these assumptions. i. can be used to evaluate data that are not continuous ii. no assumptions about distributions, independence, etc. B. Nonparametric cons 1. If using for a continuous data set, nonparametric tests throw information inherent in continuous data. 2. Reduces power to detect a statistical difference a. A more conservative approach 3. Example: For data IToma normally distributed population, if the Wilcoxon signedrank test requires 1000 observations to demonstrate statistical significance, a t test will only require 955. IV.
CONTINGENCY
TABLES
A. Contingency tables are used to examine the relationship between subjects' scores on two qualitative or categorical variables. B. One variable determines the row categories; the other variable defines the column categories. C. Example: In studying the association between smoking and disease, the row categories in the figure below denote the categories of smoking status while the columns denote the presence or absence of disease.
Smoke
v.
Yes No
A Disease Yes No 13 37 6 144
B Disease Yes No 26% 74% 4% 96%
100% 100%
cmSQUARED TEST A. Commonly used procedure, uses contingency tables B. Used to evaluate unpaired samples (unrelated groups) C. Often used to evaluate proportions D. Is there a difference in the proportion of viral infections in patients administered a vaccine? (12/100 vs. 2/100) E. Assumes nominal data (no ordering between variable groups)
j F. Limited when the numbers of subjects in any "cell" is low (rule of thumb, <5) G. Generallogic 1. Given two groups (vaccine vs control), the EXPECTED infection rate if the vaccine has no effect would be equal among the two groups. This is the null hypothesis. The chisquared test compares the EXPECTED frequency of a particular event to the OBSERVED frequency in the population of interest. H. Formulas
x2
= L (0E)2 E
with df= (r l)(c 1) ExpectedFrequencies(E) for eachcell:
.
. . Ti X T E1J = N J
I.
Distribution
18 16 14 12 10 08 06 04 02 0 0
4
8
12
16
20
24
ChiSquare distribution
Chisquared, by strict definition, is not a true nonparametric test. It assumes a distribution that can be described by a single parameter, degrees of freedom. J.
~
Chisquared example problems (refer to Example Problem handout)
~
..
L
~ J.
Chisquared example problems (refer to Example Problem handout) FISHER'S EXACT TEST
VI.
A. Alternative to chisquared for 2 x 2 contingency tables 1. Improves accuracy when expected frequencies are small «5) or sample size is small (n=20) 2. Calculates exact probabilities
b d (b+d)
a c (a + c)
(a+b)! p(outcome)=
VII.
(a +b) (c + d) N
(c+d)! (a+c)! (b+d)!
N! a! b! c! d!
MCNEMAR'S TEST OF SYMMETRY
A. Chisquared test requires samples to be independent of each other. B. McNemar's test is used when samples are related (similar to paired t test). C. There.are often times where measures may be repeated.
D. Example. Does drug X cause insomnia? 1. Patients may be questioned about insomnia before and after starting the drug. 2. The researcher asks the question, "Do more patients have insomnia since starting the drug?" E. Refer to Example Problems handout VIII.
KRUSKALW ALLIS TEST
A. Compares two independent samples B. Values of a variable are transformed to ranks. 1. Tests that there is no shift in the center of the groups (that is, the centers do not differ) C. If there are only two groups, the procedure reduces to the MannWhitney testthe analogue of the unpaired t test.
IX.
WILCOXON SIGNEDRANK TEST A. Nonparametric analogue of the paired t test B. Compares the rank values of variables pairbypair 1. The sum of the ranks associated with positive and negative differences is computed. 2. The test statistic is the lesser of the two sums of ranks. C. Refer to Example Problems handout
~
=:;
J. VI.
Chisquared example problems (refer to Example Problem handout)
~~
FISHER'S EXACT TEST' A. Alternative to chisquared for 2 x 2 contingency tables 1. Improves accuracy when expected frequencies are small «5) or sample size is small (n=20) 2. Calculates exact probabilities
a c (a + c)
b d (b + d)
=
(a+b)!
p(outcome)
VII.
(a +b) (c + d) N
(c+d)! (a+c)! (b+d)!
N! a! b! c! d!
MCNEMAR'S TEST OF SYMMETRY
A. Chisquared test requires samples to be independent of each other. B. McNemar's test is used when samples are related (similar to paired t test). C. There' are often times where measures may be repeated.
D. Example. Does drug X cause insomnia? 1. Patients may be questioned about insomnia before and after starting the drug. 2. The researcher asks the question, "Do more patients have insomnia since starting the drug?" E. Refer to Example Problems handout VIII.
KRUSKALWALLIS TEST
A. Compares two independent samples B. Values of a variable are transformed to ranks. 1. Tests that there is no shift in the center of the groups (that is, the centers do not differ) C. If there are only two groups, the procedure reduces to the MannWhitney testthe analogue of the unpaired t test. IX.
WILCOXON SIGNEDRANK TEST A. Nonparametric analogue of the paired t test B. Compares the rank values of variables pairbypair 1. The sum of the ranks associated with positive and negative differences is computed. 2. The test statistic is the lesser of the two sums of ranks. C. Refer to Example Problems handout
:::;
~
~
X. SPEARMAN RANK CORRELATION COEFFICIENT A. Nonparametric analogue oflinear regression and the correlation coefficient
Nonparametric analogue oflinear regression and the correlation coefficient (r)
rs
=1 6L:d2 n 3 n
d = difference of ranks at each point
B. Height 31 32 33 34 35 35 Rs = 6(e+
Rank 1 2 3 4 5.5 5.5
Weight 7.7 8.3 7.6 9.1 9.6 9.9
Rank 2 3 1 4 5 6
d 1 1 2 0 0.5 0.5
12+ 22+ 0 + 0.52+ 0.52)/63  6) = 0.81
For statistical significance, can look up critical values from table or obtain from software package.

~
s:
.= rt Example Problem 1: Association between tryptophan dietary supplements and eosinophiliamyalgia syndrome (EMS). A number of subjects from a particular area are evaluated; 80 patients with EMS were identified, along with 200 matched controls. Is there a statistically significant association between tryptophan use and EMS?
.
Unrelated groups, categorical (yes/no) data  chisquared is appropriate
Observed
Results: EMS
42 38 80
Yes I
Tryptophan use
No
I
I
Total
No EMS
I
34 166 200
Total
I
76 204 280
(42 of76 patients taking tryptophan had EMS, compared to 38 of 204 not taking tryptophan) Expected values if no association exists (null hypothesis):
Yes No
Tryptophan use
Total
EMS 21.7 58.3 80
No EMS 54.3 145.7 200
Total 76 204 280
The rate of EMS in the overall population, assuming no effect, would be 80/280 (28.6%). (.286*76 = 21.7; .286x204 = 58.3). The No EMS cells can then be calculated from subtracting the total (ex: 76  21.7 = 54.3). E 11 76x80 280
E21
= 204x80 280
E 12  76x200 280
E22
= 204x200 280
To evaluate significance,one needs a mean and measu:eof dispersion(ex.  standard deviation, standard error, variance, etc.). The chisquared test is based on a Poisson distribution, where mean = variance); therefore,the chisquaredtest assumes that the variance is equal to the expected mean value.
x2 X2
= I, (0E)2 E
Therefore, in this example:
= (42/21.7i/21.7 + (3454.3i/54.3 + (3858.3i/58.3 + (166145.7)2/145;7= 36.4
7 Look up the result in a chisquared table (a 2 x 2 contingency table has 1 degree of freedom). To be significant at the 0.05 level, X2must be > 3.84. Since 36.4 » 3.84, the result is highly significant.
~
..
 (
Critical Values for the ChiSquared df 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
0.10 2.7055 4.6052 6.2514 7.7794 9.2363 10.6446 12.017 13.3616 14.6837 15.9872 17.275 18.5493 19.8119 21.0641 22.3071 23.5418 24.769 25.9894 27.2036 28.412 29.6151 30.8133 32.0069 33.1962 34.3816 35.5632 36.7412 37.9159 39.0875 40.256 41. 4217 42.5847 43.7452 44.9032 46.0588 47.2122 48.3634 49.5126 50.6598 51.805 52.9485 54.0902 55.2302 56.3685 57.5053 58.6405 59.7743 60.9066 62.0375 63.1671
Significance Level 0.05 0.025 5.0239 3.8415 7.3778 5.9915 7.8147 9.3484 9.4877 11.1433 11.0705 12.8325 12.5916 14.4494 16.0128 14.0671 15.5073 17.5345 16.919 19.0228 18.307 20.4832 19.6752 21.92 21.0261 23.3367 24.7356 22.362 26.1189 23.6848 24.9958 27.4884 28.8453 26.2962 27.5871 30.191 28.8693 31.5264 32.8523 30.1435 31.4104 34.1696 32.6706 35.4789 36.7807 33.9245 35.1725 38.0756 36.415 39.3641 37.6525 40.6465 41.9231 38.8851 43.1945 40.1133 41.3372 44.4608 45.7223 42.5569 43.773 46.9792 44.9853 48.2319 46.1942 49.4804 50.7251 47.3999 48.6024 51.966 49.8018 53.2033 54.4373 50.9985 52.1923 55.668 53.3835 56.8955 54.5722 58.1201 59.3417 55.7585 56.9424 60.5606 58.124 61.7767 59.3035 62.9903 60.4809 64.2014 61.6562 65.4101 66.6165 62.8296 64.0011 67.8206 69.0226 65.1708 66.3387 70.2224 67.5048 71. 4202
Distribution 0.01 6.6349 9.2104 11.3449 13 .2767 15.0863 16.8119 18.4753 20.0902 21.666 23.2093 24.725 26.217 27.6882 29.1412 30.578 31.9999 33.4087 34.8052 36.1908 37.5663 38.9322 40.2894 41.6383 42.9798 44.314 45.6416 46.9628 48.2782 49.5878 50.8922 52.1914 53.4857 54.7754 56.0609 57.342 58.6192 59.8926 61.162 62.4281 63.6908 64.95 66.2063 67.4593 68.7096 69.9569 71.2015 72.4432 73.6826 74.9194 76.1538
0.005 7.8794 10.5965 12.8381 14.8602 16.7496 18.5475 20.2777 21.9549 23.5893 25.1881 26.7569 28.2997 29.8193 31.3194 32.8015 34.2671 35.7184 37.1564 38.5821 39.9969 41.4009 42.7957 44.1814 45.5584 46.928 48.2898 49.645 50.9936 52.3355 53.6719 55.0025 56.328 57.6483 58.9637 60.2746 61.5811 62.8832 64.1812 65.4753 66.766 68.0526 69.336 70.6157 71.8923 73.166 74.4367 75.7039 76.9689 78.2306 79.4898
\
Eample Problem 2: A sociological study evaluated the characteristics of marriage by religion; 256 people were surveyed for religion and marital status. The results were as follows:
Protestant
Never Married Divorced
Separated Total
Jewish 8 11 3 1 23
Catholic
29 75 21 8 133
16 21 6 3 46
None 20 19 13 0 52
Other 0 1 0 1 2
Total 73 127 43 13 256
Is there a relationship between marital status and religion? SYSTAT
WARNING:

chisquared output
More than onefifth Significance
tests
of fitted cells computed
Test statistic Pearson
on
this
are sparse table
Value
chisquared
22.718
are
(frequency
<
5).
suspect.
df 12.000
Prob 0.030
What happened??
Omitting sparse cells: Leave out 'other' and 'separated': Protestant
Catholic
29 75 21 125
Never Married Divorced
Total
Test statistic Pearson
chisguared
16 21 6 43
Value 10.368
Jewish 8 11 3 22
df 6.000
None 20 19 13 52
Total 73 126 43 242
prob 0.110
There is no statistically significant difference between the groups (p=O.11)
Example Problem 3: McNemar Test of Symmetry In November of 1993, the U.S. Congress approved the North American Free Trade Agreement (NAFTA). Let's say that two months before the approval and before the televised debate between Vice President Al Gore and businessman Ross Perot, political pollsters queried a sample of 350 people, asking "Are you for, unsure, or against NAFTA?" Immediately after the debate, the pollsters contacted the same people and asked the question a second time. Here are the results:
BEFORE$
(rows)
by
AFTER$
for 51 46 52 149
for unsure against Total
Percents BEFORE$
of
total
(rows)
for unsure against Total N
(columns)
unsure 22 18 49 89
Total 101 91 158 350
against 28 27 57 112
count by
AFTER$
(columns)
for 14.571 13.143 14.857 42.571 149
unsure 6.286 5.143 14.000 25.429 89
AFTER against 8.000 7.714 16.286 32.000 112
Test statistic McNemar
Pearson Symmetry
chisquared chisquared
Value 11.473 22.039
N 101 91 158
Total 28.857 26.000 45.143 100.000
350
df
Prob 4.000 3.000
0.022 0.000
The McNemar test of symmetry focuses on the counts in the offdiagonalcells (those along the diagonal are not used in the computations). We are investigating the direction of change in opinion. First, how many respondentsbecame more negative aboutNAFTA? Among those who initially responded For, 22 (6.29%) are now Unsure and 28 (8%) are now Against. Among those who were Unsure before the debate, 27 (7.71%) answered Against afterwards. The three cells in the upper right contain counts for those who became more unfavorable and comprise 22% (6.29 + 8.00 + 7.71) of the sample. The three cells in the lower left contain counts for people who became more positive about NAFTA (46, 52, and 49) or 42% of the sample. The null hypothesis for the McNemar test is that the changes in opinion are equal. The chisquared statistic for this test is 22.039 with 3 df and p<0.0005. You reject the null hypothesis. The proNAFTA shift in opinion is significantly greater than the antiNAFTA shift.
r
Example Problem 4: Wilcoxon SignedRank Test Evaluate the effect of a diuretic in healthy volunteers:
Subject
No drug
1 2 3 4 5 6
1600 1850 1300 1500 1400 1010
Daily UOP + Drug 1490 1300 1400 1410 1350 1000
Difference 110 550 +100 90 50 10
Rank of difference 5 6 4 3 2 1
Signedrank of difference 5 6 +4 3 2 1
W = sum of signed ranks = 13 If the drug has no effect, the ranks associated with a positive change should be similar to the ranks associated with a negative change; hence, the sum (W) should = O. How large must W be to call this a statistically significant difference? Refer to Critical Values table: N 5 6 7 8 9 10 11 12 13 14 15
Critical Value 15 21 19 28 24 32 28 39 33 45 39 52 44 58 50 65 57 73 63 80 70
P .062 .032 .062 .016 .046 .024 .054 .020 .054 .02 .048 .018 .054 .02 .052 .022 .048 .02 .05 .022 .048
*Due to the nature of discrete possible values ofW, p values at traditional breakpoints are usually not possible (ex.: p=0.05).