diff --git a/intro-experimental-design-hypothesis-testing/August_2023/Code/IntroStatsExpDesignHT_082323_demo_day3.Rmd b/intro-experimental-design-hypothesis-testing/August_2023/Code/IntroStatsExpDesignHT_082323_demo_day3.Rmd new file mode 100644 index 0000000..0b900a3 --- /dev/null +++ b/intro-experimental-design-hypothesis-testing/August_2023/Code/IntroStatsExpDesignHT_082323_demo_day3.Rmd @@ -0,0 +1,352 @@ +--- +title: "Hypothesis Testing" +author: "Reuben Thomas" +date: "8/23/2023" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` +##Load the data +An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chicken +```{r chickwts} +##read help page +?chickwts +##load the data.frame +chickwts <- data.frame(chickwts) +##Numerical summary of the chickwts data +summary(chickwts) + + +``` + +Let us look at the data a bit more closely +```{r} +##look at the first few rows +head(chickwts) + +##let us look at the kind of variables in the chickwts data.frame - a numerical variable, weight and a factor variable, feed with 6 levels +str(chickwts) + +``` + +After looking numerically at the data, let us look at it visually. We will plot the relationship between the weight of the chickens and the diet/feed they were on + +## Visualize the data +```{r} +## load the library to be used for plotting +suppressMessages(library(ggplot2)) +ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() + geom_jitter(width=0.1) +``` + +## One-sided, one sample t-test + + +Let us now say that we are interested in the linseed feed. And we are interested in a feed that keeps the mean chick weight above 200 units (we want healthy chickens, :)). Does the linseed feed do this? + +We would like to make any resulting claims generalizable to the entire chick population. This despite, we have only chosen (randomly and independently) 12 chicks to be fed with linseed. *All statistical tests* have underlying them something called a _Test statistic_, a number that typically capture what we are interested in testing. In our case, we are interested in the mean chick weight fed with linseed. We will end up using the mean weight scaled by the observed standard deviation of weights. _All Test statistics_ have something called a _sampling distribution_ - the reflects the distribution of the observed statistic over repeated experiments like we just performed. We have just performed one experiment now, sampled 12 chicks and fed them with linseed. The exact mathematical distribution is defined under certain assumptions - these _assumptions_ are important. For our question, a _t-statistic_ has been shown to be a good choice. The sampling distribution of the _t-statistic_ is called a _t distribution_. + +Our _null (uninteresting, skeptical) hypothesis_ is that the mean chick weight after being fed with linseed is less than 200 units. The _alternative_ (when the mean weight is greater than 200) is interesting for us. + +We will use a one-sample, one-sided t-test to answer this question. +```{r} +##load the library to filter the data +suppressMessages(library(dplyr)) +##First we need to get the weights of the chicks fed linseed +LinSeedWeights <- filter(chickwts, feed =="linseed")$weight +print(LinSeedWeights) +##let us again visualize this +boxplot(LinSeedWeights) +abline(h=200, lty=2, col="red") +``` + +Now, we will run the one-sample, one-sided t-test. +```{r} +t.test(LinSeedWeights, mu=200, alternative = "greater") +``` +Based on the p-value and a typical statistical threshold of 0.05, we are unable to reject the hypothesis linseed keeps the mean chick weights below 200 units + + +## Two-sided, two sample t-test +Now, let us move on and say we are interested in testing the differences between linseed and soybean on chick weight. As before, we will create a subset of the original data with the feeds of interest. + +```{r} +SubChickWts <- filter(chickwts, feed=="linseed" | feed=="soybean") + +##let us check the variables in SubChickWts +str(SubChickWts) + +##notice feed still appears to have 6 levels +##drop unused factor levels of feed +SubChickWts$feed <- droplevels(SubChickWts$feed) + +##now check again +str(SubChickWts) + +##let us plot this again +ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot() + geom_jitter(width=0.1) +``` + +The distribution of chick weights fed soybean appears to have slightly higher than those fed with linseed. We will compute the mean weights of the checks fed with each kind of feed. +```{r} +##mean of wight with linseed feed +print(mean(SubChickWts$weight[SubChickWts$feed == "linseed"])) +##mean of wight with soybean feed +print(mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) +``` + +Can we more formally test this? + +Specifically, can we test how likely it is to observe the given difference in the mean weights of chicks fed the two diets when in fact they do not differ. Given the observed the variance of weights across the chicks on a given feed, may be it is quite likely that we would observe this big of a difference if we were to repeat this experiment over and over again. The two-sample t-test gives us a formal way to test for this difference. The _Test statistic_ is again the _t-statistic_ and the sampling distribution is the _t distribution_ under certain assumptions (see below). + + +```{r} +t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE) +``` +The above estimated p-value suggests that we cannot the reject the hypothesis that the mean weights of the two feeds are the same. + +All statistical hypothesis tests come with a bunch lot of assumptions. The version of the t-test we used above assumed that the variances of the chick weights is the same for the two feeds. The t-test has a different version if the variances of the weights are unlikely to be equal called the Welch t-test. + +We will test equality of variances using the Brown-Forsythe test for equality of variances. +```{r} +##First we need to load the required library with this test +suppressMessages(library(onewaytests)) +bf.test(weight ~ feed, SubChickWts) +``` + +You will see that the p-value from this test is not significant, so we can assume variances are equal. Otherwise we would need to run, + +```{r} +t.test(weight ~ feed, data=SubChickWts, var.equal=FALSE) +``` +The t-test also requires the assumption of normality. This is not essential. It has been shown to be quite robust to deviations from normality. In any case, we will test for normality using the Shapiro-Wilk test. +```{r} +shapiro.test(SubChickWts$weight[SubChickWts$feed == "linseed"]) +shapiro.test(SubChickWts$weight[SubChickWts$feed == "soybean"]) +``` +In both the cases, we are unable to reject the hypothesis that the data is normal. + +If however, we ended up rejecting the normality assumption, it may be better to use a non-paramteric test that does not require the normality assumption + +```{r} +##Wilcox test is a non-parametric test that can be used as an alternative to the t-test. Wilcox test does not need the assumption of normality +wilcox.test(weight ~ feed, data=SubChickWts) +``` + + +## Statistical power estimates or Sample Size calculations +We were unable to reject the hypothesis that linseed and soybean feed kept the mean weights of the chicks the same. Let us visualize these data again, +```{r} +ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot()+ geom_jitter(width=0.1) +``` +It however appears that the soybean feed does increase the mean weight over the linseed feed. +If the increase is true then we need more samples to conclude that the soybean feed does increase the mean chick weight in a statistically significant manner. + + +How many more samples would we need? + +To do that we can perform something called a statistical power analyses. We will use a library in R called _pwr_ that will help us with these analyses. Before doing any power analyses, you need to know a bunch of things. + +1. The sampling distribution of the test statistic (t-statistic) + +2. The effect size that you want to have the statistical power to estimate + +3. At what Type I error will you be making claims of statistical significance. This is a number between 0 and 1 (typically 0.05) and represents the fraction of times (when you repeat the same experiment over and over again) when you will claim significance when in fact your null hypothesis is true (there is no difference in the mean weights). + +4. What is the desired statistical power? This is a number between 0 and 1 and represents the fraction of times (when you repeat the same experiment over and over) you want to claim significance at the chosen Type I error, when there is really a difference as captured by the effect size. + +```{r} +suppressMessages(library(pwr)) +##Let us first compute the increase in mean chick weight from linseed to soybean feed +(mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"])) +##We will now define the effect size,d as the above difference scaled by the standard deviation of the chick weights +d <- ((mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"])))/sd(SubChickWts$weight) +##load the library +##Our test statistic is the t-statistic and the relevant function is pwr.t.test +pwr.t.test(d=d, power = 0.8, sig.level = 0.05) +``` +The results says we need to have at least 60 chicks in each feed group to have a 80% chance to making a claim of differences in mean chick weights from linseed to soybean with a significance of 0.05, if there really is this underlying difference. + + +## One-way ANOVA +We will now go back to looking at the distribution of the weights of chicks fed all the diets and not just the above two one. Our null hypothesis is that the mean chick weights is same for all the 6 feeds. Let us visualize the data again, + +```{r} +ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot()+ geom_jitter(width=0.1) +``` + +The appropriate test statistic to use here is called the F-statistic, its sampling distribution is called the F-distribution. While the t-distribution captures the sampling distribution of the scaled sample mean or scaled difference of sample means, the F-distribution captures the ratio of variance in the mean chick weights between feed groups versus variance between all observations within a feed group , i.e., + +$$F = \frac{between\ feed\ group\ weight\ variance}{within\ feed\ group\ weight\ variance}$$ +So, intuitively when the mean chick weights are not different between the different feed groups, the variance between these mean weights should be similar to variances of weights within a feed group. That is, under the null hypothesis F will hover around 1. Note, when we say, "within a feed group", we don't specify which particular feed group. This should suggest to you the requirement of the assumption that within feed groups variances are same across all groups. + +We will now run the ANOVA analyses as follows: +```{r} +AmodelFit <- aov(weight ~ feed, data=chickwts) +summary(AmodelFit) +``` + +The significance above suggests that there are feeds resulting in differing mean chick weights. +We don't get information on which pairs are really different from each other. To get this information, we will perform multiple pairwise tests using Tukey's post-hoc tests. + +### Multiple testing +```{r} +TukeyHSD(AmodelFit,ordered = TRUE) +##we can visualize the confidence intervals of the differences in mean chick weights +plot(TukeyHSD(AmodelFit,ordered = TRUE)) +``` + + Note the adjusted p-value for the soybean-linseed comparison is different (0.793 vs 0.199) from what we obtained using the two-sample, two-sided t-test. The resulting confidence interval of this difference is also wider. + + +```{r} +t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE) +``` +Why did we have to correct for multiple testing, the fact we are interested in 15 comparisons? One way to try and justify it for yourself is to run through the following thought experiment. + +1. Suppose we set the significance threshold, Type I error at 0.05 as before. + +2. We will assume that of the 15 comparisons there are 8 comparisons that indicate actual difference in mean weights. + +3. We will also assume that the statistical power to detect true differences is 0.8. + +4. Then of the 8 true differences in feeds, we would expect to detect $8\times0.8\approx6$ of these findings. + +5. Of the 15 comparisons we would expect to see $15\times0.05\approx 1$ false-positive. + +6. Therefore the expected false discovery rate is $\frac{1}{1+6}=\frac{1}{7}=14\%$, higher than what we would have typically expected. + +7. This situation is further exacerbated if the number of true differences in the 15 comparisons is smaller. For example, if the number of true differences is 4. Then the expected number of these detected would be $4\times0.8\approx3$. The false discovery rate would then be $\frac{1}{1+3}=\frac{1}{4}=25\%$. Again, something which is generally considered unacceptable. + +8. Intuitively in this situation, one way to reduce the false discovery rate is to reduce Type I error, make it smaller than 0.05. This is what some of mulitple testing methods are implicitly doing. + +The ANOVA tests have two important assumptions that we will need to check. +1. Normality of the responses +2. Equality of variances of chick weights in each of the feed groups. + +We will now check these assumptions + +```{r} +##NORMALITY +shapiro.test(chickwts$weight[chickwts$feed == "casein"]) +shapiro.test(chickwts$weight[chickwts$feed == "horsebean"]) +shapiro.test(chickwts$weight[chickwts$feed == "linseed"]) +shapiro.test(chickwts$weight[chickwts$feed == "meatmeal"]) +shapiro.test(chickwts$weight[chickwts$feed == "soybean"]) +shapiro.test(chickwts$weight[chickwts$feed == "sunflower"]) +``` + +The normality assumptions does not seem to be violated. Now, we will test for equality of variances, +```{r} +## EQUALITY OF VARIANCES +bf.test(weight ~ feed, chickwts) +``` +The variances are significantly different from zero. So one of the assumptions of one-way ANOVA is violated. So we need to do something different. Some of you may think we would need to take recourse to a non-parametric test. The Kruskall-Wallis test is the non-parametric version of one-way ANOVA. + +```{r} +kruskal.test(weight ~ feed, chickwts) +``` + + +Unfortunately, we cannot use this test because it requires the assumption that the variances of chick weights to be same for the different feeds - this assumption is violated. So the next alternative would be to perform a bunch of pair-wise Welch t-tests + +```{r} +##We will also store the resulting p-values from the resulting pair-wise tests. Just for this workshop we compare the mean chick weights of all other feeds to the casein feed estimate +pValueFeedComparisons <- vector(mode = "numeric") +pValueFeedComparisons[1] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="horsebean"])$p.value +pValueFeedComparisons[2] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="linseed"])$p.value +pValueFeedComparisons[3] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="meatmeal"])$p.value +pValueFeedComparisons[4] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="soybean"])$p.value +pValueFeedComparisons[5] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="sunflower"])$p.value +names(pValueFeedComparisons) <- c("horsebean", "linseed", "meatmeal", "soybean", "sunflower") + +print(pValueFeedComparisons) +``` + + +We have performed several tests and would need correct for multiple tests. We have seen that the Tukey post hoc tests provide one way for correcting for multiple testing. There are multiple ways for correcting for multiple testing. Each has a set of assumptions and also a particular criteria it is controlling for. Here we will use two ways of correcting for multiple testing: the _Holm-Sidak_ method and the _Benjamini-Hochberg_ method. The _Holm-Sidak_ method controls for something called a family-wise error rate (similar to what the _Tukey_ method was controlling for) , or the probability on repeating these experiments over and over that you will observe at least one false positive. The _BH_ or _Benjamini-Hochberg_ method controls for False-Discovery Rate, or the expected (over repeated experiments) fraction of false positives among the rejected hypothesis. The _Holm-Sidak_ method is more conservative while the _BH_ method has more tolerance for false-positives + +```{r} +MCor <- p.adjust(pValueFeedComparisons, method = c("BH")) +MCorResults <- data.frame(pValueFeedComparisons, adj.p=p.adjust(pValueFeedComparisons, method = "holm")) +print(MCorResults) +``` +You will the results in the SidakSD and BH columns as adjusted p-values + +## Linear modeling +Linear modeling provides a really flexible approach to statistical hypothesis testing. Simple Linear Regression is a special case of linear models. In fact, the above one-way ANOVA tests could as well be performed using linear models. Let us start with an example for Simple Linear Regression. + +```{r} +?cars +summary(cars) +ggplot(cars, aes(x=speed, y=dist)) + geom_point() + geom_smooth(method = "lm", se=F) +``` +Now, we will estimate the slope and intercept of the best fit line to these data. + +```{r} +lmFit <- lm(dist ~ speed, cars) +summary(lmFit) +``` + +There are always assumptions to check for. We will visually attempt to test the assumptions of the linear regression fit. The assumptions tested visually (in order) are: + +1. Suitability of a linear (as opposed to some non-linear) model for these data. + +2. Normality of residuals (differences between observations and their predictions using the linear model). + +3. Homogeneity of variances across the fitted/predicted values of distance + +4. Influence of outliers on slope estimates + +See https://data.library.virginia.edu/diagnostic-plots/ for an elaboration of what each of these plots mean + +```{r} +plot(lmFit) +``` + + +### One-way ANOVA +We will now perform a linear model version of the one-way ANOVA test we ran above, + +```{r} +ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() + geom_jitter(width = 0.1) +lmFit <- lm(weight ~ feed, chickwts) +print(levels(chickwts$feed)) +summary(lmFit) +``` + + +### Two-way ANOVA +We will use linear models to test hypothesis in a situation where we are assessing the association of a continuous response variable with two categorical/factor variables. + +```{r} +?ToothGrowth +##numerically summarize look at the Toothgrowth data set +summary(ToothGrowth) +str(ToothGrowth) +##we change dose to a factor variable +ToothGrowth$dose <- as.numeric(ToothGrowth$dose) +str(ToothGrowth) +``` + + +Let us visualize the data now, + +```{r} + +ggplot(ToothGrowth, aes(x=dose, y=len, color=supp)) + geom_boxplot() +``` + +We will formulate a linear model to estimate the effects of _dose_ and _supp_. + +```{r} +lmFit <- lm(len ~ dose + supp + dose:supp, ToothGrowth) +summary(lmFit) + +``` +We will go over the interpretation of the estimates from this linear model fit. Note, we have estimated both the main effects and also interaction effects. + +Note, per our discussion in class yesterday the variable _supp_ can be considered to moderate the influence of tooth growth on the _dose_ of vitamin C. + + diff --git a/intro-experimental-design-hypothesis-testing/August_2023/Code/IntroStatsExpDesignHT_082323_demo_day3.html b/intro-experimental-design-hypothesis-testing/August_2023/Code/IntroStatsExpDesignHT_082323_demo_day3.html new file mode 100644 index 0000000..ced621c --- /dev/null +++ b/intro-experimental-design-hypothesis-testing/August_2023/Code/IntroStatsExpDesignHT_082323_demo_day3.html @@ -0,0 +1,1057 @@ + + + + + + + + + + + + + + +Hypothesis Testing + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

##Load the data An experiment was conducted to measure and compare +the effectiveness of various feed supplements on the growth rate of +chicken

+
##read help page
+?chickwts
+##load the data.frame
+chickwts <- data.frame(chickwts)
+##Numerical summary of the chickwts data
+summary(chickwts)
+
##      weight             feed   
+##  Min.   :108.0   casein   :12  
+##  1st Qu.:204.5   horsebean:10  
+##  Median :258.0   linseed  :12  
+##  Mean   :261.3   meatmeal :11  
+##  3rd Qu.:323.5   soybean  :14  
+##  Max.   :423.0   sunflower:12
+

Let us look at the data a bit more closely

+
##look at the first few rows
+head(chickwts)
+
##   weight      feed
+## 1    179 horsebean
+## 2    160 horsebean
+## 3    136 horsebean
+## 4    227 horsebean
+## 5    217 horsebean
+## 6    168 horsebean
+
##let us look at the kind of variables in the chickwts data.frame - a numerical variable, weight and a factor variable, feed with 6 levels
+str(chickwts)
+
## 'data.frame':    71 obs. of  2 variables:
+##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
+##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
+

After looking numerically at the data, let us look at it visually. We +will plot the relationship between the weight of the chickens and the +diet/feed they were on

+
+

Visualize the data

+
## load the library to be used for plotting
+suppressMessages(library(ggplot2))
+ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() + geom_jitter(width=0.1)
+

+
+
+

One-sided, one sample t-test

+

Let us now say that we are interested in the linseed feed. And we are +interested in a feed that keeps the mean chick weight above 200 units +(we want healthy chickens, :)). Does the linseed feed do this?

+

We would like to make any resulting claims generalizable to the +entire chick population. This despite, we have only chosen (randomly and +independently) 12 chicks to be fed with linseed. All statistical +tests have underlying them something called a Test +statistic, a number that typically capture what we are interested +in testing. In our case, we are interested in the mean chick weight fed +with linseed. We will end up using the mean weight scaled by the +observed standard deviation of weights. All Test statistics +have something called a sampling distribution - the reflects +the distribution of the observed statistic over repeated experiments +like we just performed. We have just performed one experiment now, +sampled 12 chicks and fed them with linseed. The exact mathematical +distribution is defined under certain assumptions - these +assumptions are important. For our question, a +t-statistic has been shown to be a good choice. The sampling +distribution of the t-statistic is called a t +distribution.

+

Our null (uninteresting, skeptical) hypothesis is that the +mean chick weight after being fed with linseed is less than 200 units. +The alternative (when the mean weight is greater than 200) is +interesting for us.

+

We will use a one-sample, one-sided t-test to answer this +question.

+
##load the library to filter the data
+suppressMessages(library(dplyr))
+##First we need to get the weights of the chicks fed linseed
+LinSeedWeights <- filter(chickwts, feed =="linseed")$weight
+print(LinSeedWeights)
+
##  [1] 309 229 181 141 260 203 148 169 213 257 244 271
+
##let us again visualize this
+boxplot(LinSeedWeights)
+abline(h=200, lty=2, col="red")
+

+

Now, we will run the one-sample, one-sided t-test.

+
t.test(LinSeedWeights, mu=200, alternative = "greater")
+
## 
+##  One Sample t-test
+## 
+## data:  LinSeedWeights
+## t = 1.2434, df = 11, p-value = 0.1198
+## alternative hypothesis: true mean is greater than 200
+## 95 percent confidence interval:
+##  191.6696      Inf
+## sample estimates:
+## mean of x 
+##    218.75
+

Based on the p-value and a typical statistical threshold of 0.05, we +are unable to reject the hypothesis linseed keeps the mean chick weights +below 200 units

+
+
+

Two-sided, two sample t-test

+

Now, let us move on and say we are interested in testing the +differences between linseed and soybean on chick weight. As before, we +will create a subset of the original data with the feeds of +interest.

+
SubChickWts <- filter(chickwts, feed=="linseed" | feed=="soybean")
+
+##let us check the variables in SubChickWts
+str(SubChickWts)
+
## 'data.frame':    26 obs. of  2 variables:
+##  $ weight: num  309 229 181 141 260 203 148 169 213 257 ...
+##  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 3 3 3 3 3 3 3 3 3 3 ...
+
##notice feed still appears to have 6 levels
+##drop unused factor levels of feed
+SubChickWts$feed <- droplevels(SubChickWts$feed)
+
+##now check again
+str(SubChickWts)
+
## 'data.frame':    26 obs. of  2 variables:
+##  $ weight: num  309 229 181 141 260 203 148 169 213 257 ...
+##  $ feed  : Factor w/ 2 levels "linseed","soybean": 1 1 1 1 1 1 1 1 1 1 ...
+
##let us plot this again
+ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot() + geom_jitter(width=0.1)
+

+

The distribution of chick weights fed soybean appears to have +slightly higher than those fed with linseed. We will compute the mean +weights of the checks fed with each kind of feed.

+
##mean of wight with linseed feed
+print(mean(SubChickWts$weight[SubChickWts$feed == "linseed"]))
+
## [1] 218.75
+
##mean of wight with soybean feed
+print(mean(SubChickWts$weight[SubChickWts$feed == "soybean"]))
+
## [1] 246.4286
+

Can we more formally test this?

+

Specifically, can we test how likely it is to observe the given +difference in the mean weights of chicks fed the two diets when in fact +they do not differ. Given the observed the variance of weights across +the chicks on a given feed, may be it is quite likely that we would +observe this big of a difference if we were to repeat this experiment +over and over again. The two-sample t-test gives us a formal way to test +for this difference. The Test statistic is again the +t-statistic and the sampling distribution is the t +distribution under certain assumptions (see below).

+
t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE)
+
## 
+##  Two Sample t-test
+## 
+## data:  weight by feed
+## t = -1.3208, df = 24, p-value = 0.199
+## alternative hypothesis: true difference in means between group linseed and group soybean is not equal to 0
+## 95 percent confidence interval:
+##  -70.92996  15.57282
+## sample estimates:
+## mean in group linseed mean in group soybean 
+##              218.7500              246.4286
+

The above estimated p-value suggests that we cannot the reject the +hypothesis that the mean weights of the two feeds are the same.

+

All statistical hypothesis tests come with a bunch lot of +assumptions. The version of the t-test we used above assumed that the +variances of the chick weights is the same for the two feeds. The t-test +has a different version if the variances of the weights are unlikely to +be equal called the Welch t-test.

+

We will test equality of variances using the Brown-Forsythe test for +equality of variances.

+
##First we need to load the required library with this test
+suppressMessages(library(onewaytests))
+bf.test(weight ~ feed, SubChickWts)
+
## 
+##   Brown-Forsythe Test (alpha = 0.05) 
+## ------------------------------------------------------------- 
+##   data : weight and feed 
+## 
+##   statistic  : 1.754449 
+##   num df     : 1 
+##   denom df   : 23.62952 
+##   p.value    : 0.1979871 
+## 
+##   Result     : Difference is not statistically significant. 
+## -------------------------------------------------------------
+

You will see that the p-value from this test is not significant, so +we can assume variances are equal. Otherwise we would need to run,

+
t.test(weight ~ feed, data=SubChickWts, var.equal=FALSE)
+
## 
+##  Welch Two Sample t-test
+## 
+## data:  weight by feed
+## t = -1.3246, df = 23.63, p-value = 0.198
+## alternative hypothesis: true difference in means between group linseed and group soybean is not equal to 0
+## 95 percent confidence interval:
+##  -70.84262  15.48547
+## sample estimates:
+## mean in group linseed mean in group soybean 
+##              218.7500              246.4286
+

The t-test also requires the assumption of normality. This is not +essential. It has been shown to be quite robust to deviations from +normality. In any case, we will test for normality using the +Shapiro-Wilk test.

+
shapiro.test(SubChickWts$weight[SubChickWts$feed == "linseed"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  SubChickWts$weight[SubChickWts$feed == "linseed"]
+## W = 0.96931, p-value = 0.9035
+
shapiro.test(SubChickWts$weight[SubChickWts$feed == "soybean"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  SubChickWts$weight[SubChickWts$feed == "soybean"]
+## W = 0.9464, p-value = 0.5064
+

In both the cases, we are unable to reject the hypothesis that the +data is normal.

+

If however, we ended up rejecting the normality assumption, it may be +better to use a non-paramteric test that does not require the normality +assumption

+
##Wilcox test is a non-parametric test that can be used as an alternative to the t-test. Wilcox test does not need the assumption of normality
+wilcox.test(weight ~ feed, data=SubChickWts)
+
## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
+## compute exact p-value with ties
+
## 
+##  Wilcoxon rank sum test with continuity correction
+## 
+## data:  weight by feed
+## W = 60.5, p-value = 0.2367
+## alternative hypothesis: true location shift is not equal to 0
+
+
+

Statistical power estimates or Sample Size calculations

+

We were unable to reject the hypothesis that linseed and soybean feed +kept the mean weights of the chicks the same. Let us visualize these +data again,

+
ggplot(SubChickWts, aes(x=feed, y=weight)) + geom_boxplot()+ geom_jitter(width=0.1)
+

+It however appears that the soybean feed does increase the mean weight +over the linseed feed. If the increase is true then we need more samples +to conclude that the soybean feed does increase the mean chick weight in +a statistically significant manner.

+

How many more samples would we need?

+

To do that we can perform something called a statistical power +analyses. We will use a library in R called pwr that will help +us with these analyses. Before doing any power analyses, you need to +know a bunch of things.

+
    +
  1. The sampling distribution of the test statistic +(t-statistic)

  2. +
  3. The effect size that you want to have the statistical power to +estimate

  4. +
  5. At what Type I error will you be making claims of statistical +significance. This is a number between 0 and 1 (typically 0.05) and +represents the fraction of times (when you repeat the same experiment +over and over again) when you will claim significance when in fact your +null hypothesis is true (there is no difference in the mean +weights).

  6. +
  7. What is the desired statistical power? This is a number between 0 +and 1 and represents the fraction of times (when you repeat the same +experiment over and over) you want to claim significance at the chosen +Type I error, when there is really a difference as captured by the +effect size.

  8. +
+
suppressMessages(library(pwr))
+##Let us first compute the increase in mean chick weight from linseed to soybean feed
+(mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"]))
+
## [1] 27.67857
+
##We will now define the effect size,d as the above difference scaled by the standard deviation of the chick weights
+d <- ((mean(SubChickWts$weight[SubChickWts$feed == "soybean"])) - (mean(SubChickWts$weight[SubChickWts$feed == "linseed"])))/sd(SubChickWts$weight)
+##load the library
+##Our test statistic is the t-statistic and the relevant function is pwr.t.test
+pwr.t.test(d=d, power = 0.8, sig.level = 0.05)
+
## 
+##      Two-sample t test power calculation 
+## 
+##               n = 60.85139
+##               d = 0.512026
+##       sig.level = 0.05
+##           power = 0.8
+##     alternative = two.sided
+## 
+## NOTE: n is number in *each* group
+

The results says we need to have at least 60 chicks in each feed +group to have a 80% chance to making a claim of differences in mean +chick weights from linseed to soybean with a significance of 0.05, if +there really is this underlying difference.

+
+
+

One-way ANOVA

+

We will now go back to looking at the distribution of the weights of +chicks fed all the diets and not just the above two one. Our null +hypothesis is that the mean chick weights is same for all the 6 feeds. +Let us visualize the data again,

+
ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot()+ geom_jitter(width=0.1)
+

+

The appropriate test statistic to use here is called the F-statistic, +its sampling distribution is called the F-distribution. While the +t-distribution captures the sampling distribution of the scaled sample +mean or scaled difference of sample means, the F-distribution captures +the ratio of variance in the mean chick weights between feed groups +versus variance between all observations within a feed group , i.e.,

+

\[F = \frac{between\ feed\ group\ weight\ +variance}{within\ feed\ group\ weight\ variance}\] So, +intuitively when the mean chick weights are not different between the +different feed groups, the variance between these mean weights should be +similar to variances of weights within a feed group. That is, under the +null hypothesis F will hover around 1. Note, when we say, “within a feed +group”, we don’t specify which particular feed group. This should +suggest to you the requirement of the assumption that within feed groups +variances are same across all groups.

+

We will now run the ANOVA analyses as follows:

+
AmodelFit <- aov(weight ~ feed, data=chickwts) 
+summary(AmodelFit)
+
##             Df Sum Sq Mean Sq F value   Pr(>F)    
+## feed         5 231129   46226   15.37 5.94e-10 ***
+## Residuals   65 195556    3009                     
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+

The significance above suggests that there are feeds resulting in +differing mean chick weights. We don’t get information on which pairs +are really different from each other. To get this information, we will +perform multiple pairwise tests using Tukey’s post-hoc tests.

+
+

Multiple testing

+
TukeyHSD(AmodelFit,ordered = TRUE)
+
##   Tukey multiple comparisons of means
+##     95% family-wise confidence level
+##     factor levels have been ordered
+## 
+## Fit: aov(formula = weight ~ feed, data = chickwts)
+## 
+## $feed
+##                           diff        lwr       upr     p adj
+## linseed-horsebean    58.550000 -10.413543 127.51354 0.1413329
+## soybean-horsebean    86.228571  19.541684 152.91546 0.0042167
+## meatmeal-horsebean  116.709091  46.335105 187.08308 0.0001062
+## casein-horsebean    163.383333  94.419790 232.34688 0.0000000
+## sunflower-horsebean 168.716667  99.753124 237.68021 0.0000000
+## soybean-linseed      27.678571 -35.683721  91.04086 0.7932853
+## meatmeal-linseed     58.159091  -9.072873 125.39106 0.1276965
+## casein-linseed      104.833333  39.079175 170.58749 0.0002100
+## sunflower-linseed   110.166667  44.412509 175.92082 0.0000884
+## meatmeal-soybean     30.480519 -34.414070  95.37511 0.7391356
+## casein-soybean       77.154762  13.792470 140.51705 0.0083653
+## sunflower-soybean    82.488095  19.125803 145.85039 0.0038845
+## casein-meatmeal      46.674242 -20.557722 113.90621 0.3324584
+## sunflower-meatmeal   52.007576 -15.224388 119.23954 0.2206962
+## sunflower-casein      5.333333 -60.420825  71.08749 0.9998902
+
##we can visualize the confidence intervals of the differences in mean chick weights
+plot(TukeyHSD(AmodelFit,ordered = TRUE))
+

+

Note the adjusted p-value for the soybean-linseed comparison is +different (0.793 vs 0.199) from what we obtained using the two-sample, +two-sided t-test. The resulting confidence interval of this difference +is also wider.

+
t.test(weight ~ feed, data=SubChickWts, var.equal=TRUE)
+
## 
+##  Two Sample t-test
+## 
+## data:  weight by feed
+## t = -1.3208, df = 24, p-value = 0.199
+## alternative hypothesis: true difference in means between group linseed and group soybean is not equal to 0
+## 95 percent confidence interval:
+##  -70.92996  15.57282
+## sample estimates:
+## mean in group linseed mean in group soybean 
+##              218.7500              246.4286
+

Why did we have to correct for multiple testing, the fact we are +interested in 15 comparisons? One way to try and justify it for yourself +is to run through the following thought experiment.

+
    +
  1. Suppose we set the significance threshold, Type I error at 0.05 +as before.

  2. +
  3. We will assume that of the 15 comparisons there are 8 comparisons +that indicate actual difference in mean weights.

  4. +
  5. We will also assume that the statistical power to detect true +differences is 0.8.

  6. +
  7. Then of the 8 true differences in feeds, we would expect to +detect \(8\times0.8\approx6\) of these +findings.

  8. +
  9. Of the 15 comparisons we would expect to see \(15\times0.05\approx 1\) +false-positive.

  10. +
  11. Therefore the expected false discovery rate is \(\frac{1}{1+6}=\frac{1}{7}=14\%\), higher +than what we would have typically expected.

  12. +
  13. This situation is further exacerbated if the number of true +differences in the 15 comparisons is smaller. For example, if the number +of true differences is 4. Then the expected number of these detected +would be \(4\times0.8\approx3\). The +false discovery rate would then be \(\frac{1}{1+3}=\frac{1}{4}=25\%\). Again, +something which is generally considered unacceptable.

  14. +
  15. Intuitively in this situation, one way to reduce the false +discovery rate is to reduce Type I error, make it smaller than 0.05. +This is what some of mulitple testing methods are implicitly +doing.

  16. +
+

The ANOVA tests have two important assumptions that we will need to +check. 1. Normality of the responses 2. Equality of variances of chick +weights in each of the feed groups.

+

We will now check these assumptions

+
##NORMALITY
+shapiro.test(chickwts$weight[chickwts$feed == "casein"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "casein"]
+## W = 0.91663, p-value = 0.2592
+
shapiro.test(chickwts$weight[chickwts$feed == "horsebean"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "horsebean"]
+## W = 0.93758, p-value = 0.5264
+
shapiro.test(chickwts$weight[chickwts$feed == "linseed"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "linseed"]
+## W = 0.96931, p-value = 0.9035
+
shapiro.test(chickwts$weight[chickwts$feed == "meatmeal"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "meatmeal"]
+## W = 0.97914, p-value = 0.9612
+
shapiro.test(chickwts$weight[chickwts$feed == "soybean"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "soybean"]
+## W = 0.9464, p-value = 0.5064
+
shapiro.test(chickwts$weight[chickwts$feed == "sunflower"])
+
## 
+##  Shapiro-Wilk normality test
+## 
+## data:  chickwts$weight[chickwts$feed == "sunflower"]
+## W = 0.92809, p-value = 0.3603
+

The normality assumptions does not seem to be violated. Now, we will +test for equality of variances,

+
## EQUALITY OF VARIANCES
+bf.test(weight ~ feed, chickwts)
+
## 
+##   Brown-Forsythe Test (alpha = 0.05) 
+## ------------------------------------------------------------- 
+##   data : weight and feed 
+## 
+##   statistic  : 15.51945 
+##   num df     : 5 
+##   denom df   : 58.65021 
+##   p.value    : 1.044886e-09 
+## 
+##   Result     : Difference is statistically significant. 
+## -------------------------------------------------------------
+

The variances are significantly different from zero. So one of the +assumptions of one-way ANOVA is violated. So we need to do something +different. Some of you may think we would need to take recourse to a +non-parametric test. The Kruskall-Wallis test is the non-parametric +version of one-way ANOVA.

+
kruskal.test(weight ~ feed, chickwts)
+
## 
+##  Kruskal-Wallis rank sum test
+## 
+## data:  weight by feed
+## Kruskal-Wallis chi-squared = 37.343, df = 5, p-value = 5.113e-07
+

Unfortunately, we cannot use this test because it requires the +assumption that the variances of chick weights to be same for the +different feeds - this assumption is violated. So the next alternative +would be to perform a bunch of pair-wise Welch t-tests

+
##We will also store the resulting p-values from the resulting pair-wise tests. Just for this workshop we compare the mean chick weights of all other feeds to the casein feed estimate 
+pValueFeedComparisons <- vector(mode = "numeric")
+pValueFeedComparisons[1] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="horsebean"])$p.value
+pValueFeedComparisons[2] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="linseed"])$p.value
+pValueFeedComparisons[3] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="meatmeal"])$p.value
+pValueFeedComparisons[4] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="soybean"])$p.value
+pValueFeedComparisons[5] <- t.test(chickwts$weight[chickwts$feed=="casein"], chickwts$weight[chickwts$feed=="sunflower"])$p.value
+names(pValueFeedComparisons) <- c("horsebean", "linseed", "meatmeal", "soybean", "sunflower")
+
+print(pValueFeedComparisons)
+
##    horsebean      linseed     meatmeal      soybean    sunflower 
+## 7.210250e-07 2.606217e-04 9.866222e-02 3.521263e-03 8.215118e-01
+

We have performed several tests and would need correct for multiple +tests. We have seen that the Tukey post hoc tests provide one way for +correcting for multiple testing. There are multiple ways for correcting +for multiple testing. Each has a set of assumptions and also a +particular criteria it is controlling for. Here we will use two ways of +correcting for multiple testing: the Holm-Sidak method and the +Benjamini-Hochberg method. The Holm-Sidak method +controls for something called a family-wise error rate (similar to what +the Tukey method was controlling for) , or the probability on +repeating these experiments over and over that you will observe at least +one false positive. The BH or Benjamini-Hochberg +method controls for False-Discovery Rate, or the expected (over repeated +experiments) fraction of false positives among the rejected hypothesis. +The Holm-Sidak method is more conservative while the +BH method has more tolerance for false-positives

+
MCor <- p.adjust(pValueFeedComparisons, method = c("BH"))
+MCorResults <- data.frame(pValueFeedComparisons, adj.p=p.adjust(pValueFeedComparisons, method = "holm"))
+print(MCorResults)
+
##           pValueFeedComparisons        adj.p
+## horsebean          7.210250e-07 3.605125e-06
+## linseed            2.606217e-04 1.042487e-03
+## meatmeal           9.866222e-02 1.973244e-01
+## soybean            3.521263e-03 1.056379e-02
+## sunflower          8.215118e-01 8.215118e-01
+

You will the results in the SidakSD and BH columns as adjusted +p-values

+
+
+
+

Linear modeling

+

Linear modeling provides a really flexible approach to statistical +hypothesis testing. Simple Linear Regression is a special case of linear +models. In fact, the above one-way ANOVA tests could as well be +performed using linear models. Let us start with an example for Simple +Linear Regression.

+
?cars
+summary(cars)
+
##      speed           dist       
+##  Min.   : 4.0   Min.   :  2.00  
+##  1st Qu.:12.0   1st Qu.: 26.00  
+##  Median :15.0   Median : 36.00  
+##  Mean   :15.4   Mean   : 42.98  
+##  3rd Qu.:19.0   3rd Qu.: 56.00  
+##  Max.   :25.0   Max.   :120.00
+
ggplot(cars, aes(x=speed, y=dist)) + geom_point() + geom_smooth(method = "lm", se=F)
+
## `geom_smooth()` using formula = 'y ~ x'
+

+Now, we will estimate the slope and intercept of the best fit line to +these data.

+
lmFit <- lm(dist ~ speed, cars)
+summary(lmFit)
+
## 
+## Call:
+## lm(formula = dist ~ speed, data = cars)
+## 
+## Residuals:
+##     Min      1Q  Median      3Q     Max 
+## -29.069  -9.525  -2.272   9.215  43.201 
+## 
+## Coefficients:
+##             Estimate Std. Error t value Pr(>|t|)    
+## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
+## speed         3.9324     0.4155   9.464 1.49e-12 ***
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 15.38 on 48 degrees of freedom
+## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
+## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
+

There are always assumptions to check for. We will visually attempt +to test the assumptions of the linear regression fit. The assumptions +tested visually (in order) are:

+
    +
  1. Suitability of a linear (as opposed to some non-linear) model for +these data.

  2. +
  3. Normality of residuals (differences between observations and +their predictions using the linear model).

  4. +
  5. Homogeneity of variances across the fitted/predicted values of +distance

  6. +
  7. Influence of outliers on slope estimates

  8. +
+

See https://data.library.virginia.edu/diagnostic-plots/ for +an elaboration of what each of these plots mean

+
plot(lmFit)
+

+
+

One-way ANOVA

+

We will now perform a linear model version of the one-way ANOVA test +we ran above,

+
ggplot(chickwts, aes(x=feed, y=weight)) + geom_boxplot() + geom_jitter(width = 0.1)
+

+
lmFit <- lm(weight ~ feed, chickwts)
+print(levels(chickwts$feed))
+
## [1] "casein"    "horsebean" "linseed"   "meatmeal"  "soybean"   "sunflower"
+
summary(lmFit)
+
## 
+## Call:
+## lm(formula = weight ~ feed, data = chickwts)
+## 
+## Residuals:
+##      Min       1Q   Median       3Q      Max 
+## -123.909  -34.413    1.571   38.170  103.091 
+## 
+## Coefficients:
+##               Estimate Std. Error t value Pr(>|t|)    
+## (Intercept)    323.583     15.834  20.436  < 2e-16 ***
+## feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
+## feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
+## feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
+## feedsoybean    -77.155     21.578  -3.576 0.000665 ***
+## feedsunflower    5.333     22.393   0.238 0.812495    
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 54.85 on 65 degrees of freedom
+## Multiple R-squared:  0.5417, Adjusted R-squared:  0.5064 
+## F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10
+
+
+

Two-way ANOVA

+

We will use linear models to test hypothesis in a situation where we +are assessing the association of a continuous response variable with two +categorical/factor variables.

+
?ToothGrowth
+##numerically summarize look at the Toothgrowth data set
+summary(ToothGrowth)
+
##       len        supp         dose      
+##  Min.   : 4.20   OJ:30   Min.   :0.500  
+##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
+##  Median :19.25           Median :1.000  
+##  Mean   :18.81           Mean   :1.167  
+##  3rd Qu.:25.27           3rd Qu.:2.000  
+##  Max.   :33.90           Max.   :2.000
+
str(ToothGrowth)
+
## 'data.frame':    60 obs. of  3 variables:
+##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
+##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
+##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
+
##we change dose to a factor variable
+ToothGrowth$dose <- as.numeric(ToothGrowth$dose)
+str(ToothGrowth)
+
## 'data.frame':    60 obs. of  3 variables:
+##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
+##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
+##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
+

Let us visualize the data now,

+
ggplot(ToothGrowth, aes(x=dose, y=len, color=supp)) + geom_boxplot() 
+

+

We will formulate a linear model to estimate the effects of +dose and supp.

+
lmFit <- lm(len ~ dose + supp + dose:supp, ToothGrowth)
+summary(lmFit)
+
## 
+## Call:
+## lm(formula = len ~ dose + supp + dose:supp, data = ToothGrowth)
+## 
+## Residuals:
+##     Min      1Q  Median      3Q     Max 
+## -8.2264 -2.8462  0.0504  2.2893  7.9386 
+## 
+## Coefficients:
+##             Estimate Std. Error t value Pr(>|t|)    
+## (Intercept)   11.550      1.581   7.304 1.09e-09 ***
+## dose           7.811      1.195   6.534 2.03e-08 ***
+## suppVC        -8.255      2.236  -3.691 0.000507 ***
+## dose:suppVC    3.904      1.691   2.309 0.024631 *  
+## ---
+## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+## 
+## Residual standard error: 4.083 on 56 degrees of freedom
+## Multiple R-squared:  0.7296, Adjusted R-squared:  0.7151 
+## F-statistic: 50.36 on 3 and 56 DF,  p-value: 6.521e-16
+

We will go over the interpretation of the estimates from this linear +model fit. Note, we have estimated both the main effects and also +interaction effects.

+

Note, per our discussion in class yesterday the variable +supp can be considered to moderate the influence of tooth +growth on the dose of vitamin C.

+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/intro-experimental-design-hypothesis-testing/August_2023/Presentations/IntroStatsExpDesignHT_082323_day1.pptx b/intro-experimental-design-hypothesis-testing/August_2023/Presentations/IntroStatsExpDesignHT_082323_day1.pptx new file mode 100644 index 0000000..8ab4a4b Binary files /dev/null and b/intro-experimental-design-hypothesis-testing/August_2023/Presentations/IntroStatsExpDesignHT_082323_day1.pptx differ