make some changes to the hands on

2025-11-30 09:45:43 -08:00 · 2024-01-18 11:18:04 -08:00 · 2024-01-18 11:18:04 -08:00 · 719fd6fb23
commit 719fd6fb23
parent 6380a82826
4 changed files with 173 additions and 1520 deletions
--- a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.Rmd
+++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.Rmd
@ -10,9 +10,9 @@ knitr::opts_chunk$set(echo = TRUE)
 # Load packages
 library(dplyr)    # tidyverse data frame manipulation package
 library(tidyr)    # functions to help clean data
-library(magrittr) # this package provides the pipe operator %>%
 library(readxl)   # read excel files
 library(ggplot2)  # highly customizable plots
+library(broom)    # clean up linear model results
 ```

 ## R Markdown
@ -30,11 +30,9 @@ hist(val, breaks = 20)

 ```{r}
 # The ggplot version of the same plot
-
 ggplot(data = tibble(values = val),
-         mapping = aes(x = values))+
-  geom_histogram(bins = 20)
-
+       mapping = aes(x = values)) +
+geom_histogram(bins = 20)
 ```


@ -53,29 +51,21 @@ The data we will be analyzing is from the PanTHERIA database which is "a global
 # na = "NA" tells read_xlsx how missing values appear in the data
 # the default is empty cells. Run "?read_xlsx" for more info

-sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
-sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
-sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
-
-
-colnames(sheet1) == colnames(sheet2) & colnames(sheet2) == colnames(sheet3)

 # rbind (row-bind) combines data frames by row

-pantheria <- rbind(sheet1, sheet2, sheet3)
 ```


 ```{r}
 # How many rows and columns are there?
-nrow(pantheria)
-ncol(pantheria)
+
 ```


 ```{r}
 # What does the data look like?
-head(pantheria)
+
 ```


@ -88,66 +78,36 @@ We will exploring adult body mass from these mammals as it relates to their trop

 Let's start by subsetting the data with `select()`
 ```{r}
-# Pipes (%>%) work by passing the data in front of the pipe to the first argument
+# Pipes (|>) work by passing the data in front of the pipe to the first argument
 # of the function after it, this prevents a lot of nested function calls and makes
 # code easier to read.

-pantheria <- pantheria %>%
-  select(Order,
-         Family,
-         Genus,
-         Species,
-         TrophicLevel,
-         AdultBodyMass_g) %>%
-  drop_na() %>%
-  distinct()
-
-pantheria <- select(pantheria,
-                    Order,
-                     Family,
-                     Genus,
-                     Species,
-                     TrophicLevel,
-                     AdultBodyMass_g)
-pantheria <- drop_na(pantheria)
-pantheria <- distinct(pantheria)
-
-
-

 ```

 Data is almost never clean, for example there should be only 3 trophic levels:
 ```{r}
-# unique elements of a vector
-unique(pantheria$TrophicLevel)
+
 ```

 Let's fix the TrophicLevel column using `mutate()`

 ```{r}
 # mutate allows us to add columns or modify existing ones
-pantheria <- mutate(pantheria,TrophicLevel = tolower(TrophicLevel))

-pantheria %>%
-  mutate(AdultBodyMass_g = gsub("\\.", "-",AdultBodyMass_g))
 ```

+
 ## Exercise 5: Summarizing data

 Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation.

 ```{r}
-pantheria %>%
-  group_by(TrophicLevel) %>%
-  summarize(Mean = mean(AdultBodyMass_g),
-            Median = median(AdultBodyMass_g),
-            `Standard Deviation` = sd(AdultBodyMass_g),
-            Min = min(AdultBodyMass_g),
-            Max = max(AdultBodyMass_g))
+
 ```


+
 ## Exercise 6: Plotting Data  

 According to the table above, body masses have a really wide range across trophic levels. Let's visualize the distribution of adult body masses.
@ -156,29 +116,13 @@ According to the table above, body masses have a really wide range across trophi
 # ggplot2 constructs graphics in layers, each layer is separated by "+"
 # x and y values are supplied in aes(), the type of plot is specified using
 # the "geom" functions
-ggplot(data = pantheria, mapping = aes(x = log10(AdultBodyMass_g))) +
-  geom_histogram(bins = 50, fill ="#CE3274") +
-  xlab(label = "log10 Adult Body Mass") +
-  ylab(label = "Frequency") +
-  ggtitle(label = "Histogram of Adult Body Mass") +
-  theme_classic() +
-  theme(plot.title = element_text(hjust = .5, face = "bold"),
-        axis.title.x = element_text(face = "bold"))

 ```

-The data looks skewed even after log10 transformation. Let's view the distribution by trophic level.
+The data looks skewed even after log2 transformation. Let's view the distribution by trophic level.

 ```{r}
-pantheria %>%
-  ggplot(mapping = aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) +
-  geom_histogram(bins = 50) +
-  facet_grid(rows = vars(TrophicLevel)) +
-  xlab(label = "log10 Adult Body Mass") +
-  ylab(label = "Frequency") +
-  ggtitle(label = "Histogram of Adult Body Mass") +
-  theme_classic() +
-  theme(plot.title = element_text(hjust = .5))
+
 ```

 It is clear that trophic level does have an impact on the distribution of adult body mass, carnivores tend to be smaller which makes sense because carnivores have higher metabolic demands and so there might be a selection pressure towards smaller carnivores. If we wanted to confirm this by fitting a model, we could use the `lm()` function to fit a linear model.   
@ -189,10 +133,8 @@ An important caveat of the data is that some Orders of mammals are more biodiver

 ```{r}
 # Your code 
-pantheria %>%
-  group_by(Order) %>%
-  summarise(n = (n()/nrow(pantheria)) * 100) %>%
-  arrange(desc(n))
+
+
 ```


@ -232,49 +174,47 @@ pantheria %>%

 *Exercise 7.1 hint: group_by + summarize(n = n()) + arrange* 

-## Exercise 7.2: Hands on coding   
+## Exercise 7.2: Visualize Top Orders

 Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions.


 ```{r}

-# Character vector of the top 2 Orders from above
-top_orders <- c("Rodentia", "Chiroptera")
-# filter uses a conditional to select rows from the data
-
-pantheria %>%
-  filter(Order %in% top_orders) %>%
-  ggplot(mapping = aes(x = log10(AdultBodyMass_g), fill = Order)) +
-  geom_histogram(bins = ) +
-  facet_grid(rows = vars(TrophicLevel),
-             cols = vars(Order))

 ```

 It looks like one of them is mostly made up of small carnivores. Let's remove it and redo the plot of body mass distribution by trophic level.

 ```{r}
-pantheria %>%
-  filter(Order != "Chiroptera") %>%
-  ggplot(mapping = aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) +
-  geom_histogram(bins = 40 ) +
-  facet_grid(rows = vars(TrophicLevel))

-pantheria %>%
-  ggplot(mapping = aes(x = log10(AdultBodyMass_g),fill = TrophicLevel))
 ```

-We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*. Now that we have generated these plots, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.   
+We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*.  
+
+
+## Exercise 7.3: Fit a linear model
+
+```{r}
+
+```
+
+
+```{r}
+
+```
+
+
+Now that we have generated these plots and results, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.  

 ## End of workshop exercises   

 Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the [workshop wiki](https://github.com/gladstone-institutes/Bioinformatics-Workshops/wiki/Introduction-to-R-for-Data-Analysis). R also contains many built in datasets you can use for practice:
 ```{r, echo=FALSE}
 available_datasets <- data() 
-available_datasets$results %>%
-  as_tibble() %>%
-  select(-LibPath) %>% knitr::kable()
+available_datasets$results |>
+  as_tibble() |>
+  select(-LibPath) |> knitr::kable()
 ```


--- a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html
+++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html
--- a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.Rmd
+++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.Rmd
@ -10,9 +10,9 @@ knitr::opts_chunk$set(echo = TRUE)
 # Load packages
 library(dplyr)    # tidyverse data frame manipulation package
 library(tidyr)    # functions to help clean data
-library(magrittr) # this package provides the pipe operator %>%
 library(readxl)   # read excel files
 library(ggplot2)  # highly customizable plots
+library(broom)    # clean up linear model results
 ```

 ## R Markdown
@ -30,9 +30,9 @@ hist(val, breaks = 20)

 ```{r}
 # The ggplot version of the same plot
-  ggplot(data = tibble(values = val),
-         mapping = aes(x = values))+
-  geom_histogram(bins = 20)
+ggplot(data = tibble(values = val),
+       mapping = aes(x = values)) +
+geom_histogram(bins = 20)
 ```


@ -81,18 +81,18 @@ We will exploring adult body mass from these mammals as it relates to their trop

 Let's start by subsetting the data with `select()`
 ```{r}
-# Pipes (%>%) work by passing the data in front of the pipe to the first argument
+# Pipes (|>) work by passing the data in front of the pipe to the first argument
 # of the function after it, this prevents a lot of nested function calls and makes
 # code easier to read.

-pantheria <- pantheria %>%     # Passes pantheria as the first argument of select
+pantheria <- pantheria |>     # Passes pantheria as the first argument of select
  select(Order,   
         Family,               # select returns the specified columns
         Genus,                
         Species,
         TrophicLevel,
-         AdultBodyMass_g) %>%
-  drop_na() %>%                # Remove any rows that have NAs
+         AdultBodyMass_g) |>
+  drop_na() |>                # Remove any rows that have NAs
  distinct()                   # Remove any duplicate rows
 ```

@ -105,7 +105,7 @@ Let's fix the TrophicLevel column using `mutate()`

 ```{r}
 # mutate allows us to add columns or modify existing ones
-pantheria <- pantheria %>%
+pantheria <- pantheria |>
  mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase
 ```

@ -115,13 +115,13 @@ pantheria <- pantheria %>%
 Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation.

 ```{r}
-pantheria %>%
-  group_by(TrophicLevel) %>%                            # Group observations by this column
+pantheria |>
+  group_by(TrophicLevel) |>                            # Group observations by this column
  summarize(Mean = mean(AdultBodyMass_g),               # Summarize will calculate these group wise
            `Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names
            Min = min(AdultBodyMass_g),
-            Max = max(AdultBodyMass_g)) %>%
-  ungroup() %>%
+            Max = max(AdultBodyMass_g)) |>
+  ungroup() |>
  arrange(desc(Mean))                                   # Order the data frame by descending mean body mass
 ```

@ -137,24 +137,24 @@ According to the table above, body masses have a really wide range across trophi
 # the "geom" functions

 ggplot(data = pantheria,                             # input data
-       mapping =  aes(x = log10(AdultBodyMass_g))) + # log10 transform adult body mass
+       mapping =  aes(x = log2(AdultBodyMass_g))) + # log2 transform adult body mass
  geom_histogram(fill = "#CE3274",                   # type of plot
                 bins = 40) + 
-  xlab(label = "log10 Adult Body Mass (g)") +        # x label
+  xlab(label = "log2 Adult Body Mass (g)") +        # x label
  ylab(label = "Frequency") +                        # y label
-  labs(title = "Histogram of log10 Adult Body Mass") # title 
+  labs(title = "Histogram of log2 Adult Body Mass") # title 
 ```

-The data looks skewed even after log10 transformation. Let's view the distribution by trophic level.
+The data looks skewed even after log2 transformation. Let's view the distribution by trophic level.

 ```{r}
-pantheria %>%
-  ggplot(aes(x = log10(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
+pantheria |>
+  ggplot(aes(x = log2(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
  geom_histogram(bins = 40) +
  facet_grid(rows = vars(TrophicLevel)) +                        # Split the plot into rows by trophic level
  ylab(label = "Frequency") + 
-  xlab(label = "log10 Adult Body Mass (g)") + 
-  labs(title = "Histograms of log10 Adult Body Mass") +
+  xlab(label = "log2 Adult Body Mass (g)") + 
+  labs(title = "Histograms of log2 Adult Body Mass") +
  theme(plot.title = element_text(hjust = 0.5))                  # Center the plot title
 ```

@ -166,9 +166,9 @@ An important caveat of the data is that some Orders of mammals are more biodiver

 ```{r}
 # Your code 
-pantheria %>%
-  group_by(Order) %>%
-  summarise(n = n()) %>%
+pantheria |>
+  group_by(Order) |>
+  summarise(n = n()) |>
  arrange(desc(n))

 ```
@ -210,7 +210,7 @@ pantheria %>%

 *Exercise 7.1 hint: group_by + summarize(n = n()) + arrange* 

-## Exercise 7.2: Hands on coding   
+## Exercise 7.2: Visualize Top Orders

 Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions.

@ -219,40 +219,66 @@ Now that we see what the over represented Orders are, we can plot their body mas

 top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above
 # filter uses a conditional to select rows from the data
-pantheria %>%
-  filter(Order %in% top_orders) %>%
-  ggplot(aes(x = log10(AdultBodyMass_g), fill = Order)) +
+pantheria |>
+  filter(Order %in% top_orders) |>
+  ggplot(aes(x = log2(AdultBodyMass_g), fill = Order)) +
  geom_histogram(bins = 40) +
  facet_grid(rows = vars(TrophicLevel),
             cols = vars(Order)) +
  ylab(label = "Frequency") + 
-  xlab(label = "log10 Adult Body Mass (g)") + 
+  xlab(label = "log2 Adult Body Mass (g)") + 
  theme(plot.title = element_text(hjust = 0.5))   
 ```

 It looks like one of them is mostly made up of small carnivores. Let's remove it and redo the plot of body mass distribution by trophic level.

 ```{r}
-pantheria %>%
-  filter(Order != "Chiroptera") %>%
-  ggplot(aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
+pantheria |>
+  filter(Order != "Chiroptera") |>
+  ggplot(aes(x = log2(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
  geom_histogram(bins = 40) +
  facet_grid(rows = vars(TrophicLevel)) +                       # Split the plot into rows by trophic level
  ylab(label = "Frequency") + 
-  xlab(label = "log10 Adult Body Mass (g)") + 
-  labs(title = "Histograms of log10 Adult Body Mass")
+  xlab(label = "log2 Adult Body Mass (g)") + 
+  labs(title = "Histograms of log2 Adult Body Mass")
 ```

-We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*. Now that we have generated these plots, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.   
+We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*.  
+
+
+## Exercise 7.3: Fit a linear model
+
+```{r}
+# Set the reference level to herbivore
+pantheria <- pantheria |>
+  mutate(TrophicLevel = factor(TrophicLevel,
+                               levels = c("herbivore","omnivore","carnivore")))
+
+lm(data = pantheria, log2(AdultBodyMass_g) ~ TrophicLevel) |>
+  tidy()
+```
+
+
+```{r}
+# Remove Chiroptera
+pantheria <- pantheria |>
+  filter(Order != "Chiroptera")
+
+lm(data = pantheria, log2(AdultBodyMass_g) ~ TrophicLevel) |>
+  tidy()
+```
+
+
+Now that we have generated these plots and results, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.  

 ## End of workshop exercises   

 Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the [workshop wiki](https://github.com/gladstone-institutes/Bioinformatics-Workshops/wiki/Introduction-to-R-for-Data-Analysis). R also contains many built in datasets you can use for practice:
 ```{r, echo=FALSE}
 available_datasets <- data() 
-available_datasets$results %>%
-  as_tibble() %>%
-  select(-LibPath) %>% knitr::kable()
+available_datasets$results |>
+  as_tibble() |>
+  select(-LibPath) |> knitr::kable()
 ```


--- a/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.html
+++ b/intro-r-data-analysis/Intro_to_R_workshop_materials/part_2_filled_out.html