diff --git a/hypothesis-testing/HypothesisTesting.pptx b/hypothesis-testing/HypothesisTesting.pptx new file mode 100644 index 0000000..5a87403 Binary files /dev/null and b/hypothesis-testing/HypothesisTesting.pptx differ diff --git a/hypothesis-testing/HypothesisTesting_GladstoneBIonformaticsCore_September2019.Rmd b/hypothesis-testing/HypothesisTesting_GladstoneBIonformaticsCore_September2019.Rmd new file mode 100644 index 0000000..c6ab378 --- /dev/null +++ b/hypothesis-testing/HypothesisTesting_GladstoneBIonformaticsCore_September2019.Rmd @@ -0,0 +1,354 @@ +--- +title: "Hypothesis Testing" +author: "Reuben Thomas" +date: "9/25/2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` +##Load the data +An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chicken +```{r chickwts} +##read help page +?chickwts +##load the data.frame +chickwts <- data.frame(chickwts) +##Numerical summary of the chickwts data +summary(chickwts) + + +``` + +Let us look at the data a bit more closely +```{r} +##look at the first few rows +head(chickwts) + +##let us look at the kind of variables in the chickwts data.frame - a numerical variable, weight and a factor variable, feed with 6 levels +str(chickwts) + +``` + +After looking numerically at the data, let us look at it visually. We will plot the relationship between the weight of the chickens and the diet/feed they were on + +##Visualize the data +```{r} +## load the library to be used for plotting +suppressMessages(library(ggplot2)) +ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() +``` + +##One-sided, one sample t-test +Let us now say that we are interested in the linseed feed. And we are interested in a feed that keeps the mean chick weight below 200 units. Does the linseed feed do this? + +We would like to make any resulting claims generalizable to the entire chick population. This despite, we have only chosen (randomly and independently) 12 chicks to be fed with linseed. *All statistical tests* have underlying them something called a _Test statistic_, a number that typically capture what we are interested in testing. In our case, we are interested in the mean chick weight fed with linseed. We will end up the mean weight scale by the observed standard deviation of weights. _All Test statistics_ have something called a _sampling distribution_ - the reflects the distribution of the observed statistic over repeated experiments like we just performed. We have just performed one experiment now, sampled 12 chicks and fed them with linseed. The exact mathematical distribution is defined under certain assumptions - this _assumptions_ are important. For our question, a _t-statistic_ has been shown to be a good choice. The sampling distribution of the _t-statistic_ is called a _t distribution_. + +Our _null (uninteresting, skeptical) hypothesis_ is that the mean chick weight after being fed with linseed is less than 200 units. The _alternative_ (when the mean weight is greater than 200) is interesting for us. + +Therfore, we will use a one-sample, one-sided t-test to answer this question. +```{r} +##load the library to filter the data +suppressMessages(library(dplyr)) +##First we need to get the weights of the chicks fed linseed +LinSeedWeights <- filter(chickwts, feed =="linseed")$weight + +##let us again visualize this +boxplot(LinSeedWeights) +``` + +Now, we will run the one-sample, one-sided t-test. Note we are using the typically used two-sided test. +```{r} +t.test(LinSeedWeights, mu=200, alternative = "greater") +``` +Based on the p-value and a typical statistical threshold of 0.05, we are unable to reject the hypothesis linseed keeps the mean chick weights below 200 units + + +##Two-sided, two sample t-test +Now, let us move on and say we are interested in testing the differences between linseed and soybean on chick weight. As before, we will create a subset of the original data with the feeds of interest. + +```{r} +SubChickWts <- filter(chickwts, feed=="linseed" | feed=="soybean") + +##let us check the variables in SubChickWts +str(SubChickWts) + +##notice feed still appears to have 6 levels +##drop unused factor levels of feed +SubChickWts$feed <- droplevels(SubChickWts$feed) + +##now check again +str(SubChickWts) + +##let us plot this again +ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot() +``` + +The distribution of chick weights fed soybean appears to have slightly higher than those fed with linseed. We will compute the mean weights of the checks fed with each kind of feed. +```{r} +##mean of wight with linseed feed +print(mean(SubChickWts$weight[SubChickWts$feed == "linseed"])) +##mean of wight with soybean feed +print(mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) +``` + +Can we more formally test this? + +Specifically, can we test how likely it is to observe the given difference in the mean weights of chicks fed the two diets when in fact they do not differ. Given the observed the variance of weights across the chicks on a given feed, may be it is quite likely that we would observe this big of a difference if we were to repeat this experiment over and over again. The two-sample t-test gives us a formal way to test for this difference. The _Test statistic_ is again the _t-statistic_ and the sampling distribution is the _t distribution_ under certain assumptions (see below). + + +```{r} +t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE) +``` +The above estimated p-value suggests that we cannot the reject the hypothesis that the mean weights of the two feeds are the same. + +All statistical hypothesis tests come with a bunch lot of assumptions. The version of the t-test we used above assumed that the variances of the chick weights is the same for the two feeds. The t-test has a different version if the variances of the weights are unlikely to be equal called the Welch t-test. + +We will test equality of variances using the Brown-Forsythe test for equality of variances. +```{r} +##First we need to load the required library with this test +suppressMessages(library(onewaytests)) +bf.test(weight ~ feed, SubChickWts) +``` + +You will see that the p-value from this test is not significant, so we can assume variances are equal. Otherwise we would need to run, + +```{r} +t.test(weight ~ feed, data=SubChickWts, var.equal=F) +``` +The t-test also requires the assumption of normality. This is not essential. It has been shown to be quite robust to deviations from normality. In any case, we will test for normality using the Shapiro-Wilk test. +```{r} +shapiro.test(SubChickWts$weight[SubChickWts$feed == "linseed"]) +shapiro.test(SubChickWts$weight[SubChickWts$feed == "soybean"]) +``` +In both the cases, we are unable to reject the hypothesis that the data is normal. + +If however, we ended up rejecting the normality assumption, it may be better to use a non-paramteric test that does not require the normality assumption + +```{r} +##Wilcox test is a non-parametric test that can be used as an alternative to the t-test. Wilcox test does not need the assumption of normality +wilcox.test(weight ~ feed, data=SubChickWts) +``` + + +##Statistical power estimates or Sample Size calculations +We were unable to reject the hypothesis that linseed and soybean feed kept the mean weights of the chicks the same. Let us visualize these data again, +```{r} +ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot() +``` +It however appears that the soybean feed does increase the mean weight over the linseed feed. +If the increase is true then we need more samples to conclude that the soybean feed does increase the mean chick weight in a statistically significant manner. + + +How many more samples would we need? + +To do that we can perform something called a statistical power analyses. We will use a library in R called _pwr_ that will help us with these analyses. Before doing any power analyses, you need to know a bunch of things. + +1. The sampling distribution of the test statistic (t-statistic) + +2. The effect size that you want to have the statistical power to estimate + +3. At what Type I error will you be making claims of statistical significance. This is a number between 0 and 1 (typically 0.05) and represents the fraction of times (when you repeat the same experiment over and over again) when you will claim significance when in fact your null hypothesis is true (there is no differernce in the mean weights). + +4. What is the desired statistical power? This is a number between 0 and 1 and represents the fraction of times (when you repeat the same experiment over and over) you want to claim significance at the chosen Type I error. + +```{r} +suppressMessages(library(pwr)) +##Let us first compute the increase in mean chick weight from linseed to soybean feed +(mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"])) +##We will now define the effect size,d as the above difference scaled by the standard deviation of the chick weights +d <- ((mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"])))/sd(SubChickWts$weight) +##load the library +##Our test statistic is the t-statistic and the relevant function is pwr.t.test +pwr.t.test(d=d, power = 0.8, sig.level = 0.05) +``` +The results says we need to have at least 60 chicks in each feed group to have a 80% chance to making a claim of differences in mean chick weights from linseed to soybean with a significance of 0.05 + + +##One-way ANOVA +We will now go back to looking at the distribution of the weights of chicks fed all the diet and not just the above two one. Our null hypothesis is that the mean chick weights is same for all the 6 feeds. Let us visualize the data again, + +```{r} +ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() +``` + +The appropriate test statistic to use here is called the F-statistic, its sampling distribution is called the F-distribution. While the t-distribution captures the sampling distribution of the scaled sample mean or scaled difference of sample means, the F-distribution captures the proportion of variance between all observations within a feed group due to variance in the mean chick weights between feed groups, i.e., + +$$F = \frac{between\ feed\ group\ weight\ variance}{within\ feed\ group\ weight\ variance}$$ +So, intuitively when the mean chick weights are not different between the different feed groups, the variance between these mean weights should be similar to variances of weights within a feed group. That is, under the null hypothesis F will hover around 1. Note, when we say, "within a feed group", we don't specify which particular feed group. This should suggest to you the requirement of the assumption that within feed groups variances are same across all groups. + +We will now run the ANOVA analyses as follows: +```{r} +AmodelFit <- aov(weight ~ feed, data=chickwts) +summary(AmodelFit) +``` + +The significance above suggests that there are feeds resulting in differing mean chick weights. +We don't get information on which pairs are really different from each other. To get this information, we will perform multiple pairwise tests using Tukey's posthoc tests. + +###Multiple testing +```{r} +TukeyHSD(AmodelFit,ordered = TRUE) +##we can visualize the confidence intervals of the differences in mean chick weights +plot(TukeyHSD(AmodelFit,ordered = TRUE)) +``` + + Not the adjusted p-value for the soybean-linseed comparison is different (0.793 vs 0.199) from what we obtained using the two-sample, two-sided t-test. The resulting confidence interval of this difference is also wider. + + +```{r} +t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE) +``` +Why did we have to correct for multiple testing, the fact we are interested in 15 comparisons? One way to try and justify it for yourself is to run through the following thought experiment. + +1. Suppose we set the significance threshold, Type I error at 0.05 as before. + +2. We will assume that of the 15 comparisons there are 8 comparisons that indicate actual difference in mean weights. + +3. We will also assume that the statistical power to detect true differences is 0.8. + +4. Then of the 8 true differences in feeds, we would expect to detect $8\times0.8\approx6$ of these findings. + +5. Of the 15 comparisons we would expect to see $15\times0.05\approx 1$ false-positive. + +6. Therefore the expected false discovery rate is $\frac{1}{1+6}=\frac{1}{7}=14\%$, higher than what we would have typically expected. + +7. This situation is further exacerbated if the number of true differences in the 15 comparisons is smaller. For example, if the number of true differences is 4. Then the expected number of these detected would be $4\times0.8\approx3$. The false discovery rate would then be $\frac{1}{1+3}=\frac{1}{4}=25\%$. Again, something which is generally considered unacceptable. + +8. Intuitively in this situation, one way to reduce the false discovery rate is to reduce Type I error, make it smaller than 0.05. This is what some of mulitple testing methods are implicitly doing. + +The ANOVA tests have two important assumptions that we will need to check. +1. Normality of the responses +2. Equality of variances of chick weights in each of the feed groups. + +We will now check these assumptions + +```{r} +##NORMALITY +shapiro.test(chickwts$weight[chickwts$feed == "casein"]) +shapiro.test(chickwts$weight[chickwts$feed == "horsebean"]) +shapiro.test(chickwts$weight[chickwts$feed == "linseed"]) +shapiro.test(chickwts$weight[chickwts$feed == "meatmeal"]) +shapiro.test(chickwts$weight[chickwts$feed == "soybean"]) +shapiro.test(chickwts$weight[chickwts$feed == "sunflower"]) +``` + +The normality assumptions does not seem to be violated. Now, we will test for equality of variances, +```{r} +##EQUALITY OF VARIANCES +bf.test(weight ~ feed, chickwts) +``` +The variances are significantly different from zero. So one of the assumptions of one-way ANOVA is violated. So we need to do something different. Some of you may think we would need to take recourse to a non-parametric test. The Kruskall-Wallis test is the non-parametric version of one-way ANOVA. + +```{r} +kruskal.test(weight ~ feed, chickwts) +``` + + +Unfortunately, we cannot use this test because it requires the assumption that the variances of chick weights to be same for the different feeds - this assumption is violated. So the next alternative would be to perform a bunch of pair-wise Welch t-tests + +```{r} +##We will also store the resulting p-values from the resulting pair-wise tests. Just for this workshop we compare the mean chick weights of all other feeds to the casein feed estimate +pValueFeedComparisons <- vector(mode = "numeric") +pValueFeedComparisons[1] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="horsebean"])$p.value +pValueFeedComparisons[2] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="linseed"])$p.value +pValueFeedComparisons[3] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="meatmeal"])$p.value +pValueFeedComparisons[4] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="soybean"])$p.value +pValueFeedComparisons[5] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="sunflower"])$p.value +names(pValueFeedComparisons) <- c("horsebean", "linseed", "meatmeal", "soybean", "sunflower") + +print(pValueFeedComparisons) +``` + + +We have performed several tests and would need correct for multiple tests. We have seen that the Tukey post hoc tests provide one way for correcting for multiple testing. There are multiple ways for correcting for multiple testing. Each has a set of assumptions and also a particular criteria it is controlling for. Here we will use two ways of correcting for multiple testing: the _Holm-Sidak_ method and the _Benjamini-Hochberg_ method. The _Holm-Sidak_ method controls for something called a family-wise error rate (similar to what the _Tukey_ method was controlling for) , or the probability on repeating these experiments over and over that you will observe at least one false positive. The _BH_ or _Benjamini-Hochberg_ method controls for False-Discovery Rate, or the expected (over repeated experiments) fraction of false positives among the rejected hypothesis. The _Holm-Sidak_ method is more conservative while the _BH_ method has more tolerance for false-positives + +```{r} +##load the require library +suppressMessages(library(multtest)) +MCor <- mt.rawp2adjp(pValueFeedComparisons, proc = c("SidakSD", "BH")) +MCorResults <- cbind(pValueFeedComparisons[MCor$index], MCor$adjp) +head(MCorResults) +``` +You will the results in the SidakSD and BH columns as adjusted p-values + +##Linear modeling +Linear modeling provides a really flexible approach to statistical hypothesis testing. Simple Linear Regression is a special case of linear models. In fact, the above one-way ANOVA tests could as well be performed using linear models. Let us start with an example for Simple Linear Regression. + +```{r} +?cars +summary(cars) +ggplot(cars, aes(x=speed, y=dist)) + geom_point() + geom_smooth(method = "lm", se=F) +``` +Now, we will estimate the slope and intercept of the best fit line to these data. + +```{r} +lmFit <- lm(dist ~ speed, cars) +summary(lmFit) +``` + +There are always assumptions to check for. We will visually attempt to test the assumptions of the linear regression fit. The assumptions tested visually (in order) are: + +1. Suitability of a linear (as opposed to some non-linear) model for these data. + +2. Normality of residuals (differences between observations and their predictions using the linear model). + +3. Homogenity of variances across the fitted/predicted values of distance + +4. Influence of outliers on slope estimates + +```{r} +plot(lmFit) +``` + + +###One-way ANOVA +We will now perform a linear model version of the one-way ANOVA test we ran above, + +```{r} +ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() +lmFit <- lm(weight ~ feed, chickwts) +summary(lmFit) +``` + + +###Two-way ANOVA +We will use linear models to test hypothesis in a situation where we are assessing the association of a continuous response variable with two categorical/factor variables. + +```{r} +?ToothGrowth +##numerically summarize look at the Toothgrowth data set +summary(ToothGrowth) +str(ToothGrowth) +##we change dose to a factor variable +ToothGrowth$dose <- as.factor(ToothGrowth$dose) +str(ToothGrowth) +``` + + +Let us visualize the data now, + +```{r} + +ggplot(ToothGrowth, aes(x=dose, y=len, color=supp)) + geom_boxplot() +``` + +We will formulate a linear model to estimate the effects of _dose_ and _supp_. + +```{r} +lmFit <- lm(len ~ dose + supp + dose:supp, ToothGrowth) +summary(lmFit) + +``` +We will go over the interpretation of the estimates from this linear model fit. Note, we have estimated both the main effects and also interaction effects. + + +##Clustered data +###Paired t-test + +##2x2 tables + + + + diff --git a/hypothesis-testing/HypothesisTesting_GladstoneBIonformaticsCore_September2019.html b/hypothesis-testing/HypothesisTesting_GladstoneBIonformaticsCore_September2019.html new file mode 100644 index 0000000..639cbb8 --- /dev/null +++ b/hypothesis-testing/HypothesisTesting_GladstoneBIonformaticsCore_September2019.html @@ -0,0 +1,734 @@ + + + + + + + + + + + + + + +Hypothesis Testing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Load the data

+

An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chicken

+
##read help page
+?chickwts
+##load the data.frame
+chickwts <- data.frame(chickwts)
+##Numerical summary of the chickwts data
+summary(chickwts)
+
##      weight             feed   
+##  Min.   :108.0   casein   :12  
+##  1st Qu.:204.5   horsebean:10  
+##  Median :258.0   linseed  :12  
+##  Mean   :261.3   meatmeal :11  
+##  3rd Qu.:323.5   soybean  :14  
+##  Max.   :423.0   sunflower:12
+

Let us look at the data a bit more closely

+
##look at the first few rows
+head(chickwts)
+
##   weight      feed
+## 1    179 horsebean
+## 2    160 horsebean
+## 3    136 horsebean
+## 4    227 horsebean
+## 5    217 horsebean
+## 6    168 horsebean
+
##let us look at the kind of variables in the chickwts data.frame - a numerical variable, weight and a factor variable, feed with 6 levels
+str(chickwts)
+
## 'data.frame':    71 obs. of  2 variables:
+##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
+##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
+

After looking numerically at the data, let us look at it visually. We will plot the relationship between the weight of the chickens and the diet/feed they were on

+
+
+

Visualize the data

+
## load the library to be used for plotting
+suppressMessages(library(ggplot2))
+
## Warning: package 'ggplot2' was built under R version 3.5.2
+
ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot()
+

+
+
+

One-sided, one sample t-test

+

Let us now say that we are interested in the linseed feed. And we are interested in a feed that keeps the mean chick weight below 200 units. Does the linseed feed do this?

+

We would like to make any resulting claims generalizable to the entire chick population. This despite, we have only chosen (randomly and independently) 12 chicks to be fed with linseed. All statistical tests have underlying them something called a Test statistic, a number that typically capture what we are interested in testing. In our case, we are interested in the mean chick weight fed with linseed. We will end up the mean weight scale by the observed standard deviation of weights. All Test statistics have something called a sampling distribution - the reflects the distribution of the observed statistic over repeated experiments like we just performed. We have just performed one experiment now, sampled 12 chicks and fed them with linseed. The exact mathematical distribution is defined under certain assumptions - this assumptions are important. For our question, a t-statistic has been shown to be a good choice. The sampling distribution of the t-statistic is called a t distribution.

+

Our null (uninteresting, skeptical) hypothesis is that the mean chick weight after being fed with linseed is less than 200 units. The alternative (when the mean weight is greater than 200) is interesting for us.

+

Therfore, we will use a one-sample, one-sided t-test to answer this question.

+
##load the library to filter the data
+suppressMessages(library(dplyr))
+
## Warning: package 'dplyr' was built under R version 3.5.2
+
##First we need to get the weights of the chicks fed linseed
+LinSeedWeights <- filter(chickwts, feed =="linseed")$weight
+
+##let us again visualize this
+boxplot(LinSeedWeights)
+

+

Now, we will run the one-sample, one-sided t-test. Note we are using the typically used two-sided test.

+
t.test(LinSeedWeights, mu=200, alternative = "greater")
+
## 
+##  One Sample t-test
+## 
+## data:  LinSeedWeights
+## t = 1.2434, df = 11, p-value = 0.1198
+## alternative hypothesis: true mean is greater than 200
+## 95 percent confidence interval:
+##  191.6696      Inf
+## sample estimates:
+## mean of x 
+##    218.75
+

Based on the p-value and a typical statistical threshold of 0.05, we are unable to reject the hypothesis linseed keeps the mean chick weights below 200 units

+
+
+

Two-sided, two sample t-test

+

Now, let us move on and say we are interested in testing the differences between linseed and soybean on chick weight. As before, we will create a subset of the original data with the feeds of interest.

+
SubChickWts <- filter(chickwts, feed=="linseed" | feed=="soybean")
+
+##let us check the variables in SubChickWts
+str(SubChickWts)
+
## 'data.frame':    26 obs. of  2 variables:
+##  $ weight: num  309 229 181 141 260 203 148 169 213 257 ...
+##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 3 3 3 3 3 3 3 3 3 3 ...
+
##notice feed still appears to have 6 levels
+##drop unused factor levels of feed
+SubChickWts$feed <- droplevels(SubChickWts$feed)
+
+##now check again
+str(SubChickWts)
+
## 'data.frame':    26 obs. of  2 variables:
+##  $ weight: num  309 229 181 141 260 203 148 169 213 257 ...
+##  $ feed  : Factor w/ 2 levels "linseed","soybean": 1 1 1 1 1 1 1 1 1 1 ...
+
##let us plot this again
+ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot()
+

+

The distribution of chick weights fed soybean appears to have slightly higher than those fed with linseed. We will compute the mean weights of the checks fed with each kind of feed.

+
##mean of wight with linseed feed
+print(mean(SubChickWts$weight[SubChickWts$feed == "linseed"]))
+
## [1] 218.75
+
##mean of wight with soybean feed
+print(mean(SubChickWts$weight[SubChickWts$feed == "soybean"]))
+
## [1] 246.4286
+

Can we more formally test this?

+

Specifically, can we test how likely it is to observe the given difference in the mean weights of chicks fed the two diets when in fact they do not differ. Given the observed the variance of weights across the chicks on a given feed, may be it is quite likely that we would observe this big of a difference if we were to repeat this experiment over and over again. The two-sample t-test gives us a formal way to test for this difference. The Test statistic is again the t-statistic and the sampling distribution is the t distribution under certain assumptions (see below).

+
t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE)
+
## 
+##  Two Sample t-test
+## 
+## data:  weight by feed
+## t = -1.3208, df = 24, p-value = 0.199
+## alternative hypothesis: true difference in means is not equal to 0
+## 95 percent confidence interval:
+##  -70.92996  15.57282
+## sample estimates:
+## mean in group linseed mean in group soybean 
+##              218.7500              246.4286
+

The above estimated p-value suggests that we cannot the reject the hypothesis that the mean weights of the two feeds are the same.

+

All statistical hypothesis tests come with a bunch lot of assumptions. The version of the t-test we used above assumed that the variances of the chick weights is the same for the two feeds. The t-test has a different version if the variances of the weights are unlikely to be equal called the Welch t-test.

+

We will test equality of variances using the Brown-Forsythe test for equality of variances.

+
##First we need to load the required library with this test
+suppressMessages(library(onewaytests))
+
## Warning: package 'onewaytests' was built under R version 3.5.2
+
bf.test(weight ~ feed, SubChickWts)
+
## 
+##   Brown-Forsythe Test 
+## --------------------------------------------------------- 
+##   data : weight and feed 
+## 
+##   statistic  : 1.754449 
+##   num df     : 1 
+##   denom df   : 23.62952 
+##   p.value    : 0.1979871 
+## 
+##   Result     : Difference is not statistically significant. 
+## ---------------------------------------------------------
+

You will see that the p-value from this test is not significant, so we can assume variances are equal. Otherwise we would need to run,

+
t.test(weight ~ feed, data=SubChickWts, var.equal=F)
+
## 
+##  Welch Two Sample t-test
+## 
+## data:  weight by feed
+## t = -1.3246, df = 23.63, p-value = 0.198
+## alternative hypothesis: true difference in means is not equal to 0
+## 95 percent confidence interval:
+##  -70.84262  15.48547
+## sample estimates:
+## mean in group linseed mean in group soybean 
+##              218.7500              246.4286
+

The t-test also requires the assumption of normality. This is not essential. It has been shown to be quite robust to deviations from normality. In any case, we will test for normality using the Shapiro-Wilk test.

+
shapiro.test(SubChickWts$weight[SubChickWts$feed == "linseed"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  SubChickWts$weight[SubChickWts$feed == "linseed"]
+## W = 0.96931, p-value = 0.9035
+
shapiro.test(SubChickWts$weight[SubChickWts$feed == "soybean"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  SubChickWts$weight[SubChickWts$feed == "soybean"]
+## W = 0.9464, p-value = 0.5064
+

In both the cases, we are unable to reject the hypothesis that the data is normal.

+

If however, we ended up rejecting the normality assumption, it may be better to use a non-paramteric test that does not require the normality assumption

+
##Wilcox test is a non-parametric test that can be used as an alternative to the t-test. Wilcox test does not need the assumption of normality
+wilcox.test(weight ~ feed, data=SubChickWts)
+
## Warning in wilcox.test.default(x = c(309, 229, 181, 141, 260, 203, 148, :
+## cannot compute exact p-value with ties
+
## 
+##  Wilcoxon rank sum test with continuity correction
+## 
+## data:  weight by feed
+## W = 60.5, p-value = 0.2367
+## alternative hypothesis: true location shift is not equal to 0
+
+
+

Statistical power estimates or Sample Size calculations

+

We were unable to reject the hypothesis that linseed and soybean feed kept the mean weights of the chicks the same. Let us visualize these data again,

+
ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot()
+

It however appears that the soybean feed does increase the mean weight over the linseed feed. If the increase is true then we need more samples to conclude that the soybean feed does increase the mean chick weight in a statistically significant manner.

+

How many more samples would we need?

+

To do that we can perform something called a statistical power analyses. We will use a library in R called pwr that will help us with these analyses. Before doing any power analyses, you need to know a bunch of things.

+
    +
  1. The sampling distribution of the test statistic (t-statistic)

  2. +
  3. The effect size that you want to have the statistical power to estimate

  4. +
  5. At what Type I error will you be making claims of statistical significance. This is a number between 0 and 1 (typically 0.05) and represents the fraction of times (when you repeat the same experiment over and over again) when you will claim significance when in fact your null hypothesis is true (there is no differernce in the mean weights).

  6. +
  7. What is the desired statistical power? This is a number between 0 and 1 and represents the fraction of times (when you repeat the same experiment over and over) you want to claim significance at the chosen Type I error.

  8. +
+
suppressMessages(library(pwr))
+##Let us first compute the increase in mean chick weight from linseed to soybean feed
+(mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"]))
+
## [1] 27.67857
+
##We will now define the effect size,d as the above difference scaled by the standard deviation of the chick weights
+d <- ((mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"])))/sd(SubChickWts$weight)
+##load the library
+##Our test statistic is the t-statistic and the relevant function is pwr.t.test
+pwr.t.test(d=d, power = 0.8, sig.level = 0.05)
+
## 
+##      Two-sample t test power calculation 
+## 
+##               n = 60.85139
+##               d = 0.512026
+##       sig.level = 0.05
+##           power = 0.8
+##     alternative = two.sided
+## 
+## NOTE: n is number in *each* group
+

The results says we need to have at least 60 chicks in each feed group to have a 80% chance to making a claim of differences in mean chick weights from linseed to soybean with a significance of 0.05

+
+
+

One-way ANOVA

+

We will now go back to looking at the distribution of the weights of chicks fed all the diet and not just the above two one. Our null hypothesis is that the mean chick weights is same for all the 6 feeds. Let us visualize the data again,

+
ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot()
+

+

The appropriate test statistic to use here is called the F-statistic, its sampling distribution is called the F-distribution. While the t-distribution captures the sampling distribution of the scaled sample mean or scaled difference of sample means, the F-distribution captures the proportion of variance between all observations within a feed group due to variance in the mean chick weights between feed groups, i.e.,

+

\[F = \frac{between\ feed\ group\ weight\ variance}{within\ feed\ group\ weight\ variance}\] So, intuitively when the mean chick weights are not different between the different feed groups, the variance between these mean weights should be similar to variances of weights within a feed group. That is, under the null hypothesis F will hover around 1. Note, when we say, “within a feed group”, we don’t specify which particular feed group. This should suggest to you the requirement of the assumption that within feed groups variances are same across all groups.

+

We will now run the ANOVA analyses as follows:

+
AmodelFit <- aov(weight ~ feed, data=chickwts) 
+summary(AmodelFit)
+
##             Df Sum Sq Mean Sq F value   Pr(>F)    
+## feed         5 231129   46226   15.37 5.94e-10 ***
+## Residuals   65 195556    3009                     
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+

The significance above suggests that there are feeds resulting in differing mean chick weights. We don’t get information on which pairs are really different from each other. To get this information, we will perform multiple pairwise tests using Tukey’s posthoc tests.

+
+

Multiple testing

+
TukeyHSD(AmodelFit,ordered = TRUE)
+
##   Tukey multiple comparisons of means
+##     95% family-wise confidence level
+##     factor levels have been ordered
+## 
+## Fit: aov(formula = weight ~ feed, data = chickwts)
+## 
+## $feed
+##                           diff        lwr       upr     p adj
+## linseed-horsebean    58.550000 -10.413543 127.51354 0.1413329
+## soybean-horsebean    86.228571  19.541684 152.91546 0.0042167
+## meatmeal-horsebean  116.709091  46.335105 187.08308 0.0001062
+## casein-horsebean    163.383333  94.419790 232.34688 0.0000000
+## sunflower-horsebean 168.716667  99.753124 237.68021 0.0000000
+## soybean-linseed      27.678571 -35.683721  91.04086 0.7932853
+## meatmeal-linseed     58.159091  -9.072873 125.39106 0.1276965
+## casein-linseed      104.833333  39.079175 170.58749 0.0002100
+## sunflower-linseed   110.166667  44.412509 175.92082 0.0000884
+## meatmeal-soybean     30.480519 -34.414070  95.37511 0.7391356
+## casein-soybean       77.154762  13.792470 140.51705 0.0083653
+## sunflower-soybean    82.488095  19.125803 145.85039 0.0038845
+## casein-meatmeal      46.674242 -20.557722 113.90621 0.3324584
+## sunflower-meatmeal   52.007576 -15.224388 119.23954 0.2206962
+## sunflower-casein      5.333333 -60.420825  71.08749 0.9998902
+
##we can visualize the confidence intervals of the differences in mean chick weights
+plot(TukeyHSD(AmodelFit,ordered = TRUE))
+

+

Not the adjusted p-value for the soybean-linseed comparison is different (0.793 vs 0.199) from what we obtained using the two-sample, two-sided t-test. The resulting confidence interval of this difference is also wider.

+
t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE)
+
## 
+##  Two Sample t-test
+## 
+## data:  weight by feed
+## t = -1.3208, df = 24, p-value = 0.199
+## alternative hypothesis: true difference in means is not equal to 0
+## 95 percent confidence interval:
+##  -70.92996  15.57282
+## sample estimates:
+## mean in group linseed mean in group soybean 
+##              218.7500              246.4286
+

Why did we have to correct for multiple testing, the fact we are interested in 15 comparisons? One way to try and justify it for yourself is to run through the following thought experiment.

+
    +
  1. Suppose we set the significance threshold, Type I error at 0.05 as before.

  2. +
  3. We will assume that of the 15 comparisons there are 8 comparisons that indicate actual difference in mean weights.

  4. +
  5. We will also assume that the statistical power to detect true differences is 0.8.

  6. +
  7. Then of the 8 true differences in feeds, we would expect to detect \(8\times0.8\approx6\) of these findings.

  8. +
  9. Of the 15 comparisons we would expect to see \(15\times0.05\approx 1\) false-positive.

  10. +
  11. Therefore the expected false discovery rate is \(\frac{1}{1+6}=\frac{1}{7}=14\%\), higher than what we would have typically expected.

  12. +
  13. This situation is further exacerbated if the number of true differences in the 15 comparisons is smaller. For example, if the number of true differences is 4. Then the expected number of these detected would be \(4\times0.8\approx3\). The false discovery rate would then be \(\frac{1}{1+3}=\frac{1}{4}=25\%\). Again, something which is generally considered unacceptable.

  14. +
  15. Intuitively in this situation, one way to reduce the false discovery rate is to reduce Type I error, make it smaller than 0.05. This is what some of mulitple testing methods are implicitly doing.

  16. +
+

The ANOVA tests have two important assumptions that we will need to check. 1. Normality of the responses 2. Equality of variances of chick weights in each of the feed groups.

+

We will now check these assumptions

+
##NORMALITY
+shapiro.test(chickwts$weight[chickwts$feed == "casein"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "casein"]
+## W = 0.91663, p-value = 0.2592
+
shapiro.test(chickwts$weight[chickwts$feed == "horsebean"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "horsebean"]
+## W = 0.93758, p-value = 0.5264
+
shapiro.test(chickwts$weight[chickwts$feed == "linseed"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "linseed"]
+## W = 0.96931, p-value = 0.9035
+
shapiro.test(chickwts$weight[chickwts$feed == "meatmeal"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "meatmeal"]
+## W = 0.97914, p-value = 0.9612
+
shapiro.test(chickwts$weight[chickwts$feed == "soybean"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "soybean"]
+## W = 0.9464, p-value = 0.5064
+
shapiro.test(chickwts$weight[chickwts$feed == "sunflower"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "sunflower"]
+## W = 0.92809, p-value = 0.3603
+

The normality assumptions does not seem to be violated. Now, we will test for equality of variances,

+
##EQUALITY OF VARIANCES
+bf.test(weight ~ feed, chickwts)
+
## 
+##   Brown-Forsythe Test 
+## --------------------------------------------------------- 
+##   data : weight and feed 
+## 
+##   statistic  : 15.51945 
+##   num df     : 5 
+##   denom df   : 58.65021 
+##   p.value    : 1.044886e-09 
+## 
+##   Result     : Difference is statistically significant. 
+## ---------------------------------------------------------
+

The variances are significantly different from zero. So one of the assumptions of one-way ANOVA is violated. So we need to do something different. Some of you may think we would need to take recourse to a non-parametric test. The Kruskall-Wallis test is the non-parametric version of one-way ANOVA.

+
kruskal.test(weight ~ feed, chickwts)
+
## 
+##  Kruskal-Wallis rank sum test
+## 
+## data:  weight by feed
+## Kruskal-Wallis chi-squared = 37.343, df = 5, p-value = 5.113e-07
+

Unfortunately, we cannot use this test because it requires the assumption that the variances of chick weights to be same for the different feeds - this assumption is violated. So the next alternative would be to perform a bunch of pair-wise Welch t-tests

+
##We will also store the resulting p-values from the resulting pair-wise tests. Just for this workshop we compare the mean chick weights of all other feeds to the casein feed estimate 
+pValueFeedComparisons <- vector(mode = "numeric")
+pValueFeedComparisons[1] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="horsebean"])$p.value
+pValueFeedComparisons[2] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="linseed"])$p.value
+pValueFeedComparisons[3] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="meatmeal"])$p.value
+pValueFeedComparisons[4] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="soybean"])$p.value
+pValueFeedComparisons[5] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="sunflower"])$p.value
+names(pValueFeedComparisons) <- c("horsebean", "linseed", "meatmeal", "soybean", "sunflower")
+
+print(pValueFeedComparisons)
+
##    horsebean      linseed     meatmeal      soybean    sunflower 
+## 7.210250e-07 2.606217e-04 9.866222e-02 3.521263e-03 8.215118e-01
+

We have performed several tests and would need correct for multiple tests. We have seen that the Tukey post hoc tests provide one way for correcting for multiple testing. There are multiple ways for correcting for multiple testing. Each has a set of assumptions and also a particular criteria it is controlling for. Here we will use two ways of correcting for multiple testing: the Holm-Sidak method and the Benjamini-Hochberg method. The Holm-Sidak method controls for something called a family-wise error rate (similar to what the Tukey method was controlling for) , or the probability on repeating these experiments over and over that you will observe at least one false positive. The BH or Benjamini-Hochberg method controls for False-Discovery Rate, or the expected (over repeated experiments) fraction of false positives among the rejected hypothesis. The Holm-Sidak method is more conservative while the BH method has more tolerance for false-positives

+
##load the require library
+suppressMessages(library(multtest))
+
## Warning: package 'multtest' was built under R version 3.5.1
+
## Warning: package 'BiocGenerics' was built under R version 3.5.1
+
## Warning: package 'Biobase' was built under R version 3.5.1
+
MCor <- mt.rawp2adjp(pValueFeedComparisons, proc = c("SidakSD", "BH"))
+MCorResults <- cbind(pValueFeedComparisons[MCor$index], MCor$adjp)
+head(MCorResults)
+
##                                rawp      SidakSD           BH
+## horsebean 7.210250e-07 7.210250e-07 3.605120e-06 3.605125e-06
+## linseed   2.606217e-04 2.606217e-04 1.042079e-03 6.515541e-04
+## soybean   3.521263e-03 3.521263e-03 1.052664e-02 5.868772e-03
+## meatmeal  9.866222e-02 9.866222e-02 1.875902e-01 1.233278e-01
+## sunflower 8.215118e-01 8.215118e-01 8.215118e-01 8.215118e-01
+

You will the results in the SidakSD and BH columns as adjusted p-values

+
+
+
+

Linear modeling

+

Linear modeling provides a really flexible approach to statistical hypothesis testing. Simple Linear Regression is a special case of linear models. In fact, the above one-way ANOVA tests could as well be performed using linear models. Let us start with an example for Simple Linear Regression.

+
?cars
+summary(cars)
+
##      speed           dist       
+##  Min.   : 4.0   Min.   :  2.00  
+##  1st Qu.:12.0   1st Qu.: 26.00  
+##  Median :15.0   Median : 36.00  
+##  Mean   :15.4   Mean   : 42.98  
+##  3rd Qu.:19.0   3rd Qu.: 56.00  
+##  Max.   :25.0   Max.   :120.00
+
ggplot(cars, aes(x=speed, y=dist)) + geom_point() + geom_smooth(method = "lm", se=F)
+

Now, we will estimate the slope and intercept of the best fit line to these data.

+
lmFit <- lm(dist ~ speed, cars)
+summary(lmFit)
+
## 
+## Call:
+## lm(formula = dist ~ speed, data = cars)
+## 
+## Residuals:
+##     Min      1Q  Median      3Q     Max 
+## -29.069  -9.525  -2.272   9.215  43.201 
+## 
+## Coefficients:
+##             Estimate Std. Error t value Pr(>|t|)    
+## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
+## speed         3.9324     0.4155   9.464 1.49e-12 ***
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 15.38 on 48 degrees of freedom
+## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
+## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
+

There are always assumptions to check for. We will visually attempt to test the assumptions of the linear regression fit. The assumptions tested visually (in order) are:

+
    +
  1. Suitability of a linear (as opposed to some non-linear) model for these data.

  2. +
  3. Normality of residuals (differences between observations and their predictions using the linear model).

  4. +
  5. Homogenity of variances across the fitted/predicted values of distance

  6. +
  7. Influence of outliers on slope estimates

  8. +
+
plot(lmFit)
+

+
+

One-way ANOVA

+

We will now perform a linear model version of the one-way ANOVA test we ran above,

+
ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot()
+

+
lmFit <- lm(weight ~ feed, chickwts)
+summary(lmFit)
+
## 
+## Call:
+## lm(formula = weight ~ feed, data = chickwts)
+## 
+## Residuals:
+##      Min       1Q   Median       3Q      Max 
+## -123.909  -34.413    1.571   38.170  103.091 
+## 
+## Coefficients:
+##               Estimate Std. Error t value Pr(>|t|)    
+## (Intercept)    323.583     15.834  20.436  < 2e-16 ***
+## feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
+## feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
+## feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
+## feedsoybean    -77.155     21.578  -3.576 0.000665 ***
+## feedsunflower    5.333     22.393   0.238 0.812495    
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 54.85 on 65 degrees of freedom
+## Multiple R-squared:  0.5417, Adjusted R-squared:  0.5064 
+## F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10
+
+
+

Two-way ANOVA

+

We will use linear models to test hypothesis in a situation where we are assessing the association of a continuous response variable with two categorical/factor variables.

+
?ToothGrowth
+##numerically summarize look at the Toothgrowth data set
+summary(ToothGrowth)
+
##       len        supp         dose      
+##  Min.   : 4.20   OJ:30   Min.   :0.500  
+##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
+##  Median :19.25           Median :1.000  
+##  Mean   :18.81           Mean   :1.167  
+##  3rd Qu.:25.27           3rd Qu.:2.000  
+##  Max.   :33.90           Max.   :2.000
+
str(ToothGrowth)
+
## 'data.frame':    60 obs. of  3 variables:
+##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
+##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
+##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
+
##we change dose to a factor variable
+ToothGrowth$dose <- as.factor(ToothGrowth$dose)
+str(ToothGrowth)
+
## 'data.frame':    60 obs. of  3 variables:
+##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
+##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
+##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
+

Let us visualize the data now,

+
ggplot(ToothGrowth, aes(x=dose, y=len, color=supp)) + geom_boxplot()
+

+

We will formulate a linear model to estimate the effects of dose and supp.

+
lmFit <- lm(len ~ dose + supp + dose:supp, ToothGrowth)
+summary(lmFit)
+
## 
+## Call:
+## lm(formula = len ~ dose + supp + dose:supp, data = ToothGrowth)
+## 
+## Residuals:
+##    Min     1Q Median     3Q    Max 
+##  -8.20  -2.72  -0.27   2.65   8.27 
+## 
+## Coefficients:
+##              Estimate Std. Error t value Pr(>|t|)    
+## (Intercept)    13.230      1.148  11.521 3.60e-16 ***
+## dose1           9.470      1.624   5.831 3.18e-07 ***
+## dose2          12.830      1.624   7.900 1.43e-10 ***
+## suppVC         -5.250      1.624  -3.233  0.00209 ** 
+## dose1:suppVC   -0.680      2.297  -0.296  0.76831    
+## dose2:suppVC    5.330      2.297   2.321  0.02411 *  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 3.631 on 54 degrees of freedom
+## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7746 
+## F-statistic: 41.56 on 5 and 54 DF,  p-value: < 2.2e-16
+

We will go over the interpretation of the estimates from this linear model fit. Note, we have estimated both the main effects and also interaction effects.

+
+
+
+

Clustered data

+
+

Paired t-test

+
+
+
+

2x2 tables

+
+ + + + +
+ + + + + + + + diff --git a/hypothesis-testing/MuckeLabStatisticsPrimer_2019.docx b/hypothesis-testing/MuckeLabStatisticsPrimer_2019.docx new file mode 100644 index 0000000..6474c94 Binary files /dev/null and b/hypothesis-testing/MuckeLabStatisticsPrimer_2019.docx differ