finish up part 2

This commit is contained in:
Natalie Elphick 2024-01-18 10:45:39 -08:00
parent b16ecd33a9
commit 6380a82826
12 changed files with 13664 additions and 225 deletions

View file

@ -2562,7 +2562,7 @@ class CountdownTimer {
<section>
<h1 class="title">Introduction to R Data Analysis - Part 1</h1>
<h2 class="author">Natalie Elphick</h2>
<h3 class="date">January 22nd</h3>
<h3 class="date">January 22nd, 2024</h3>
</section>
<section id="section" class="slide level2">
@ -2593,16 +2593,15 @@ Matlab etc.)</li>
<section id="part-1" class="slide level2">
<h2>Part 1:</h2>
<ol type="1">
<li><p>What is R and why should you use it?</p></li>
<li><p>The RStudio interface</p></li>
<li><p>File types</p></li>
<li><p>Error messages</p></li>
<li><p>Variables</p></li>
<li><p>Types &amp; data structures</p>
<p><em>10 min break</em></p></li>
<li><p>Math and logic operations</p></li>
<li><p>Functions and libraries</p></li>
<li><p>Reading data into R</p></li>
<li>What is R and why should you use it?</li>
<li>The RStudio interface</li>
<li>File types</li>
<li>Error messages</li>
<li>Variables</li>
<li>Types &amp; data structures</li>
<li>Math and logic operations</li>
<li>Functions and libraries</li>
<li>Reading data into R</li>
</ol>
</section>
<section>
@ -2732,13 +2731,13 @@ and periods</li>
one style of names</li>
</ul>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Snake case</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>dog_breeds <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;Labrador Retriever&quot;</span>,<span class="st">&quot;Akita&quot;</span>, <span class="st">&quot;Bulldog&quot;</span>)</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>dog_breeds <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;Labrador Retriever&quot;</span>, <span class="st">&quot;Akita&quot;</span>, <span class="st">&quot;Bulldog&quot;</span>)</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="co"># Period separated</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>dog.breeds <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;Labrador Retriever&quot;</span>,<span class="st">&quot;Akita&quot;</span>, <span class="st">&quot;Bulldog&quot;</span>)</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>dog.breeds <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;Labrador Retriever&quot;</span>, <span class="st">&quot;Akita&quot;</span>, <span class="st">&quot;Bulldog&quot;</span>)</span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Camel case</span></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a>DogBreeds <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;Labrador Retriever&quot;</span>,<span class="st">&quot;Akita&quot;</span>, <span class="st">&quot;Bulldog&quot;</span>)</span></code></pre></div>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a>DogBreeds <span class="ot">&lt;-</span> <span class="fu">c</span>(<span class="st">&quot;Labrador Retriever&quot;</span>, <span class="st">&quot;Akita&quot;</span>, <span class="st">&quot;Bulldog&quot;</span>)</span></code></pre></div>
</section>
<section id="poll-2" class="slide level2">
<h2>Poll 2</h2>
@ -2821,7 +2820,7 @@ types/structures</strong> (ex. nested lists)</li>
<section id="min-break" class="title-slide slide level1">
<h1>10 min break</h1>
<center>
<div class="countdown" id="timer_0b16aa2b" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:5%;padding:50px;font-size:5em;position: relative; width: min-content;">
<div class="countdown" id="timer_386e6f50" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:5%;padding:50px;font-size:5em;position: relative; width: min-content;">
<div class="countdown-controls"><button class="countdown-bump-down"></button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>
@ -2966,131 +2965,6 @@ RNA-Seq Analysis Using R</a>
<li>Feb 1, 2024 (9:30am-12:00pm)</li>
</ul></li>
</ol>
</section></section>
<section>
<section id="chatgpt-tips-for-r" class="title-slide slide level1">
<h1>ChatGPT Tips for R</h1>
</section>
<section id="general-tips" class="slide level2">
<h2>General Tips</h2>
<ul>
<li>Always confirm ChatGPTs outputs are correct</li>
<li>Provide as much detail as possible about the problem in the 1st
prompt</li>
<li>Use separate chats for separate tasks/projects</li>
<li>Try the Custom Instructions function that adds additional
information to every prompt</li>
<li>Can visit webpages (GPT 4 only), which can help get more specific
answers</li>
</ul>
</section>
<section id="code-tips" class="slide level2">
<h2>Code Tips</h2>
<ul>
<li>Commented R code yields better responses in my experience</li>
<li>Provide the code and error message in the same prompt</li>
<li>ChatGPT can work well to convert syntax and improve your code:
<ul>
<li>“Turn this loop into a function : [your code]”</li>
<li>“Is there a better way to do this : [your code]”</li>
</ul></li>
<li>Check out the file:
<code>example_code/1_convert_syntax_example.R</code> for an example use
case</li>
</ul>
</section></section>
<section>
<section id="finding-r-packages" class="title-slide slide level1">
<h1>Finding R Packages</h1>
</section>
<section id="key-questions" class="slide level2">
<h2>Key Questions</h2>
<ul>
<li>What assay was the package designed for?</li>
<li>When was the last release?</li>
<li>Is it maintained (frequent updates)?</li>
<li>Does it work on all operating systems?</li>
<li>Are other people using it? (citations)</li>
<li>Do they respond to github issues?</li>
<li>Is there a benchmarking paper?</li>
</ul>
</section>
<section id="bioconductor-and-cran" class="slide level2">
<h2>BioConductor and CRAN</h2>
<ul>
<li><p>Both of these have stringent requirements for packages they host
(eg. for BioConductor they have to run on all major operating
systems)</p></li>
<li><p>Prefer BioConductor packages if available over CRAN</p></li>
<li><p>Prefer CRAN packages over ones only hosted on GitHub</p></li>
</ul>
</section>
<section id="start-with-the-assay" class="slide level2">
<h2>Start with the Assay</h2>
<ul>
<li>Click <a href="https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing">here</a>
to go to BioC views</li>
<li>Pick the assay you want to analyse</li>
<li>Pick the type of analysis you want to do</li>
<li>Find a package that does it</li>
<li>Find benchmarking papers to narrow the list of packages down</li>
<li>Find the vignette on the package page and refer to the manual for
any questions not covered by it</li>
</ul>
</section></section>
<section>
<section id="additional-resources" class="title-slide slide level1">
<h1>Additional Resources</h1>
</section>
<section id="r-1" class="slide level2">
<h2>R</h2>
<ul>
<li><p><a href="https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html">R
Markdown: The Definitive Guide</a> : Excellent R markdown
reference</p></li>
<li><p><a href="https://r4ds.hadley.nz/">R for Data Science</a></p></li>
<li><p><a href="https://ggplot2-book.org/">ggplot2: elegant graphics for
data analysis</a></p></li>
<li><p><a href="https://adv-r.hadley.nz/">Advanced R</a></p></li>
</ul>
</section>
<section id="statistics" class="slide level2">
<h2>Statistics</h2>
<ul>
<li><a href="https://bookdown.org/steve_midway/DAR">Data Analysis in
R</a> : This book has more statistics details than <em>R for Data
Science</em></li>
<li><a href="https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html">Generalized
Linear Models</a><br />
</li>
<li><a href="https://bookdown.org/steve_midway/DAR/random-effects.html">Random
Effects</a></li>
</ul>
</section>
<section id="rna-seq-analysis" class="slide level2">
<h2>RNA-seq Analysis</h2>
<ul>
<li><a href="https://rnaseq.uoregon.edu/">RNA-seqlopedia</a> :
Everything you need to know about RNA-seq experiments</li>
<li><a href="https://luisvalesilva.com/datasimple/rna-seq_units.html">RNA-seq
Expression Units</a> : Blog post on understanding common units</li>
<li><a href="https://bioconductor.org/books/3.17/OSCA.intro/index.html">Introduction
to Single-Cell Analysis with Bioconductor</a> : Covers the basics of
scRNA-seq analysis in R</li>
</ul>
</section>
<section id="dimensional-reduction" class="slide level2">
<h2>Dimensional Reduction</h2>
<ul>
<li><a href="https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/">Tutorial
on PCA</a> : PCA explained with R code examples</li>
<li><a href="https://pair-code.github.io/understanding-umap/">Understanding
UMAP</a> : Short explanation with great visualizations, mainly useful
for scRNA-seq analysis</li>
</ul>
</section></section>
</div>
</div>

File diff suppressed because one or more lines are too long

View file

@ -8,6 +8,7 @@
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_Unix_Part_1.html">Introduction to Unix - Part 1</li>
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_Unix_Part_2.html">Introduction to Unix - Part 2</li>
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_R_data_analysis_part_1.html">Introduction to R Data Analysis - Part 1</li>
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_R_data_analysis_part_2.html">Introduction to R Data Analysis - Part 2</li>
</ul>
</body>
</html>

View file

@ -1,7 +1,7 @@
---
title: "Introduction to R Data Analysis - Part 1"
author: "Natalie Elphick"
date: "January 22nd"
date: "January 22nd, 2024"
knit: (function(input, ...) {
rmarkdown::render(
input,
@ -47,8 +47,8 @@ Bioinformatician I
5. Variables
6. Types & data structures
7. Math and logic operations
8. Functions and libraries
9. Reading data into R
8. Functions and packages
# What is R?
@ -142,13 +142,13 @@ to one style of names
```{r}
# Snake case
dog_breeds <- c("Labrador Retriever","Akita", "Bulldog")
dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog")
# Period separated
dog.breeds <- c("Labrador Retriever","Akita", "Bulldog")
dog.breeds <- c("Labrador Retriever", "Akita", "Bulldog")
# Camel case
DogBreeds <- c("Labrador Retriever","Akita", "Bulldog")
DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
```
## Poll 2
@ -329,81 +329,3 @@ packages
- Feb 1, 2024 (9:30am-12:00pm)
# ChatGPT Tips for R
## General Tips
- Always confirm ChatGPT's outputs are correct
- Provide as much detail as possible about the problem in the 1st prompt
- Use separate chats for separate tasks/projects
- Try the 'Custom Instructions' function that adds additional information to every prompt
- Can visit webpages (GPT 4 only), which can help get more specific answers
## Code Tips
- Commented R code yields better responses in my experience
- Provide the code and error message in the same prompt
- ChatGPT can work well to convert syntax and improve your code:
- "Turn this loop into a function : [your code]"
- "Is there a better way to do this : [your code]"
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
# Finding R Packages
## Key Questions
- What assay was the package designed for?
- When was the last release?
- Is it maintained (frequent updates)?
- Does it work on all operating systems?
- Are other people using it? (citations)
- Do they respond to github issues?
- Is there a benchmarking paper?
## BioConductor and CRAN
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
- Prefer BioConductor packages if available over CRAN
- Prefer CRAN packages over ones only hosted on GitHub
## Start with the Assay
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
- Pick the assay you want to analyse
- Pick the type of analysis you want to do
- Find a package that does it
- Find benchmarking papers to narrow the list of packages down
- Find the vignette on the package page and refer to the manual for any questions not covered by it
# Additional Resources
## R
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [R for Data Science](https://r4ds.hadley.nz/)
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
- [Advanced R](https://adv-r.hadley.nz/)
## Statistics
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\
- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html)
## RNA-seq Analysis
- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments
- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units
- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R
## Dimensional Reduction
- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples
- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis

View file

@ -0,0 +1,348 @@
---
title: "Introduction to R Data Analysis - Part 2"
author: "Natalie Elphick"
date: "January 23rd, 2024"
knit: (function(input, ...) {
rmarkdown::render(
input,
output_dir = "../docs"
)
})
output:
revealjs::revealjs_presentation:
theme: simple
css: style.css
---
```{r, setup, include=FALSE}
library(kableExtra)
library(tidyverse)
library(readxl)
theme_set(theme_grey(base_size = 16))
```
##
<center>*Press the ? key for tips on navigating these slides*</center>
# Schedule
1. Introduction to Tidyverse
2. Filtering and reformatting data
3. Plotting data
4. Hands on data analysis
# Introduction to Tidyverse
## Tidyverse
- The tidyverse packages work well together because they share
common data representations and design principles
- Rows = observations, columns = variables
- [ggplot2](), for data visualization.
- [dplyr](), for data manipulation.
- [tidyr](), for data tidying.
- [readr](), for data import.
- [purrr](), for iteration.
- and more..
## dplyr
- Offers a common “grammar” of functions for data manipulation
- [mutate()](https://dplyr.tidyverse.org/reference/mutate.html) adds new variables that are functions of existing
columns
- [select()](https://dplyr.tidyverse.org/reference/select.html) picks columns based on their names
- [filter()](https://dplyr.tidyverse.org/reference/filter.html) picks rows based on their values
- [summarise()](https://dplyr.tidyverse.org/reference/summarise.html) reduces multiple values down to a single summary
- [arrange()](https://dplyr.tidyverse.org/reference/arrange.html) changes the ordering of the rows
- [group_by()](https://dplyr.tidyverse.org/reference/group_by.html) allows any operation to be done “by group”
## Example Dataframe
- mpg is a dataframe built into the ggplot2 package
```{r, eval = FALSE}
head(mpg)
```
```{r, echo = FALSE}
head(mpg) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Select Columns
```{r, eval = FALSE}
select(.data = mpg,
year, cty, hwy, manufacturer)
```
```{r, echo = FALSE}
select(.data = mpg,
year, cty, hwy, manufacturer) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Filter Rows
```{r, eval = FALSE}
filter(.data = mpg,
year == 2008)
```
```{r, echo = FALSE}
filter(.data = mpg,
year == 2008) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Arrange Rows
- desc() is used to arrange rows in descending order, the default is ascending
```{r, eval = FALSE}
arrange(.data = mpg,
desc(cyl))
```
```{r, echo = FALSE}
arrange(.data = mpg,
desc(cyl)) |>
head(n = 3) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Summarising data
- The dplyr **summarize()**function computes a table of
summaries for a data frame
- **group_by()** groups the input data frame by the specified
variable(s)
- Combining these two allows us to easily create summaries for
different categorical groupings
## Group and Summarise
```{r, eval = FALSE}
summarise(group_by(.data = mpg,
manufacturer),
mean_cty = mean(cty),
median_cty = median(cty))
```
```{r, echo = FALSE}
summarise(group_by(.data = mpg,
manufacturer),
mean_cty = mean(cty),
median_cty = median(cty)) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## The pipe operator |>
- Allows "chaining" of function calls to make code more readable
```{r, eval = FALSE}
mpg |>
group_by(manufacturer) |>
summarise(mean_cty = mean(cty),
median_cty = median(cty))
```
```{r, echo = FALSE}
mpg |>
group_by(manufacturer) |>
summarise(mean_cty = mean(cty),
median_cty = median(cty)) |>
head(n = 4) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
# Plotting
## ggplot2
- The most popular tidyverse package
- Create publication quality, highly customizable plots
- See the [R graph gallery](https://r-graph-gallery.com/index.html) for examples
- ggplots use “layers” to build, modify and overlap visualizations
- Layers are added using the + symbol and can be added to an existing ggplot
- Many popular packages output ggplots which can then be easily modified by adding layers
## Creating ggplots
<br>
</br>
![Plotting](assets/plotting.png)
## Plot Example
```{r, fig.dim=c(6,4)}
ggplot(data = mpg, # Input dataframe
mapping = aes(x = cty, y = hwy)) + # Aesthetic mapping
geom_point() # Point graph
```
## Adding and Modifying Layers
```{r, fig.dim=c(10,4)}
ggplot(data = mpg,
mapping = aes(x = class, y = cty, fill = class)) +
geom_violin() +
geom_boxplot(width = 0.1,
fill = "white")
```
# 10 min break
<center>
```{r, echo=FALSE}
countdown::countdown(minutes = 10,
seconds = 0,
color_border = "black",
padding = "50px",
margin = "5%",
font_size = "5em",
style = "position: relative; width: min-content;")
```
</center>
# Hands-on Data Analysis
## Dataset Description
- PanTHERIA
- A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature
- Macroecological and macroevolutionary research projects
- Data is organized by taxonomic rank
## Taxonomic Rank
![Taxonomy](assets/Taxonomic_Rank_Graph.svg)
## Data Preview
```{r, echo = FALSE}
read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Hands-on Analysis
- Open part_2.Rmd
# ChatGPT Tips for R
## General Tips
- Always confirm ChatGPT's outputs are correct
- Provide as much detail as possible about the problem in the 1st prompt
- Use separate chats for separate tasks/projects
- Try the 'Custom Instructions' function that adds additional information to every prompt
- Can visit webpages (GPT 4 only), which can help get more specific answers
## Code Tips
- Commented R code yields better responses in my experience
- Provide the code and error message in the same prompt
- ChatGPT can work well to convert syntax and improve your code:
- "Turn this loop into a function : [your code]"
- "Is there a better way to do this : [your code]"
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
# Finding R Packages
## Key Questions
- What assay was the package designed for?
- When was the last release?
- Is it maintained (frequent updates)?
- Does it work on all operating systems?
- Are other people using it? (citations)
- Do they respond to github issues?
- Is there a benchmarking paper?
## BioConductor and CRAN
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
- Prefer BioConductor packages if available over CRAN
- Prefer CRAN packages over ones only hosted on GitHub
## Start with the Assay
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
- Pick the assay you want to analyse
- Pick the type of analysis you want to do
- Find a package that does it
- Find benchmarking papers to narrow the list of packages down
- Find the vignette on the package page and refer to the manual for any questions not covered by it
# Additional Resources
## R
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [R for Data Science](https://r4ds.hadley.nz/)
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
- [Advanced R](https://adv-r.hadley.nz/)
## Statistics
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\
- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html)
## RNA-seq Analysis
- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments
- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units
- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R
## Dimensional Reduction
- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples
- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis
# End of Part 2
## Workshop survey
- Please fill out our [workshop survey](https://www.surveymonkey.com/r/F75J6VZ) so we can continue to improve these workshops
## Upcoming Workshops
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0)
- Jan 25, 2024 (Session 1 - 10am12pm) (Session 2 - 1pm3pm)
- Jan 26, 2024 (Session 3 - 10am12pm)
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4)
- Feb 1, 2024 (9:30am-12:00pm)

File diff suppressed because one or more lines are too long

View file

@ -0,0 +1,259 @@
---
title: "Intro to R Data Analysis: Part 2"
output: html_document # knitr report document type
date: "`r Sys.Date()`" # This will update the date everytime you knit the doc
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Load packages
library(dplyr) # tidyverse data frame manipulation package
library(tidyr) # functions to help clean data
library(magrittr) # this package provides the pipe operator %>%
library(readxl) # read excel files
library(ggplot2) # highly customizable plots
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. Guide to markdown syntax <https://www.markdownguide.org/basic-syntax/>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r}
# Simulates 100 observations from a normal distribution
# and plots a histogram
val <- rnorm(n = 100)
hist(val, breaks = 20)
```
```{r}
# The ggplot version of the same plot
ggplot(data = tibble(values = val),
mapping = aes(x = values))+
geom_histogram(bins = 20)
```
**Important: before running the code below, click Session -> Set Working Directory -> To Source File Location**
## Exercise 3: Reading in Data
The data we will be analyzing is from the PanTHERIA database which is "a global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals (PanTHERIA) developed for a number of macroecological and macroevolutionary research projects."
```{r}
# The data is spread across 3 sheets in an excel file. We need to
# combine these data into one table/data frame.
# na = "NA" tells read_xlsx how missing values appear in the data
# the default is empty cells. Run "?read_xlsx" for more info
sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
# rbind (row-bind) combines data frames by row
pantheria <- rbind(sheet1, sheet2, sheet3)
```
```{r}
# How many rows and columns are there?
nrow(pantheria)
ncol(pantheria)
```
```{r}
# What does the data look like?
head(pantheria)
```
## Exercise 4: Filtering and Reformatting Data
We will exploring adult body mass from these mammals as it relates to their trophic level using `dpylr` and `ggplot2`. Download the cheatsheets for these packages at the following links:
* [dplyr cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf)
* [ggplot2 cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf)
Let's start by subsetting the data with `select()`
```{r}
# Pipes (%>%) work by passing the data in front of the pipe to the first argument
# of the function after it, this prevents a lot of nested function calls and makes
# code easier to read.
pantheria <- pantheria %>% # Passes pantheria as the first argument of select
select(Order,
Family, # select returns the specified columns
Genus,
Species,
TrophicLevel,
AdultBodyMass_g) %>%
drop_na() %>% # Remove any rows that have NAs
distinct() # Remove any duplicate rows
```
Data is almost never clean, for example there should be only 3 trophic levels:
```{r}
unique(pantheria$TrophicLevel) # unique elements of a vector
```
Let's fix the TrophicLevel column using `mutate()`
```{r}
# mutate allows us to add columns or modify existing ones
pantheria <- pantheria %>%
mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase
```
## Exercise 5: Summarizing data
Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation.
```{r}
pantheria %>%
group_by(TrophicLevel) %>% # Group observations by this column
summarize(Mean = mean(AdultBodyMass_g), # Summarize will calculate these group wise
`Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names
Min = min(AdultBodyMass_g),
Max = max(AdultBodyMass_g)) %>%
ungroup() %>%
arrange(desc(Mean)) # Order the data frame by descending mean body mass
```
## Exercise 6: Plotting Data
According to the table above, body masses have a really wide range across trophic levels. Let's visualize the distribution of adult body masses.
```{r}
# ggplot2 constructs graphics in layers, each layer is separated by "+"
# x and y values are supplied in aes(), the type of plot is specified using
# the "geom" functions
ggplot(data = pantheria, # input data
mapping = aes(x = log10(AdultBodyMass_g))) + # log10 transform adult body mass
geom_histogram(fill = "#CE3274", # type of plot
bins = 40) +
xlab(label = "log10 Adult Body Mass (g)") + # x label
ylab(label = "Frequency") + # y label
labs(title = "Histogram of log10 Adult Body Mass") # title
```
The data looks skewed even after log10 transformation. Let's view the distribution by trophic level.
```{r}
pantheria %>%
ggplot(aes(x = log10(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
geom_histogram(bins = 40) +
facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level
ylab(label = "Frequency") +
xlab(label = "log10 Adult Body Mass (g)") +
labs(title = "Histograms of log10 Adult Body Mass") +
theme(plot.title = element_text(hjust = 0.5)) # Center the plot title
```
It is clear that trophic level does have an impact on the distribution of adult body mass, carnivores tend to be smaller which makes sense because carnivores have higher metabolic demands and so there might be a selection pressure towards smaller carnivores. If we wanted to confirm this by fitting a model, we could use the `lm()` function to fit a linear model.
## Exercise 7.1: Hands on coding
An important caveat of the data is that some Orders of mammals are more biodiverse than others and are therefore over represented in the dataset. Using the `dplyr` cheatsheet, write code generates a table of Orders and what percentage of the data they are. Scroll down to see the hint if you are having trouble.
```{r}
# Your code
pantheria %>%
group_by(Order) %>%
summarise(n = n()) %>%
arrange(desc(n))
```
*Exercise 7.1 hint: group_by + summarize(n = n()) + arrange*
## Exercise 7.2: Hands on coding
Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions.
```{r}
top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above
# filter uses a conditional to select rows from the data
pantheria %>%
filter(Order %in% top_orders) %>%
ggplot(aes(x = log10(AdultBodyMass_g), fill = Order)) +
geom_histogram(bins = 40) +
facet_grid(rows = vars(TrophicLevel),
cols = vars(Order)) +
ylab(label = "Frequency") +
xlab(label = "log10 Adult Body Mass (g)") +
theme(plot.title = element_text(hjust = 0.5))
```
It looks like one of them is mostly made up of small carnivores. Let's remove it and redo the plot of body mass distribution by trophic level.
```{r}
pantheria %>%
filter(Order != "Chiroptera") %>%
ggplot(aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
geom_histogram(bins = 40) +
facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level
ylab(label = "Frequency") +
xlab(label = "log10 Adult Body Mass (g)") +
labs(title = "Histograms of log10 Adult Body Mass")
```
We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*. Now that we have generated these plots, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.
## End of workshop exercises
Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the [workshop wiki](https://github.com/gladstone-institutes/Bioinformatics-Workshops/wiki/Introduction-to-R-for-Data-Analysis). R also contains many built in datasets you can use for practice:
```{r, echo=FALSE}
available_datasets <- data()
available_datasets$results %>%
as_tibble() %>%
select(-LibPath) %>% knitr::kable()
```

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 122 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

View file

@ -14,7 +14,7 @@
.reveal pre code {
background-color: #d5d5d5 !important;
color: #333 !important;
font-size: 1.5em !important;
font-size: 1.25em !important;
}
/* Left-align all code outputs */
.reveal pre code {