mirror of
https://github.com/gladstone-institutes/Bioinformatics-Workshops.git
synced 2025-11-30 09:45:43 -08:00
finish up part 2
This commit is contained in:
parent
b16ecd33a9
commit
6380a82826
12 changed files with 13664 additions and 225 deletions
|
|
@ -2562,7 +2562,7 @@ class CountdownTimer {
|
|||
<section>
|
||||
<h1 class="title">Introduction to R Data Analysis - Part 1</h1>
|
||||
<h2 class="author">Natalie Elphick</h2>
|
||||
<h3 class="date">January 22nd</h3>
|
||||
<h3 class="date">January 22nd, 2024</h3>
|
||||
</section>
|
||||
|
||||
<section id="section" class="slide level2">
|
||||
|
|
@ -2593,16 +2593,15 @@ Matlab etc.)</li>
|
|||
<section id="part-1" class="slide level2">
|
||||
<h2>Part 1:</h2>
|
||||
<ol type="1">
|
||||
<li><p>What is R and why should you use it?</p></li>
|
||||
<li><p>The RStudio interface</p></li>
|
||||
<li><p>File types</p></li>
|
||||
<li><p>Error messages</p></li>
|
||||
<li><p>Variables</p></li>
|
||||
<li><p>Types & data structures</p>
|
||||
<p><em>10 min break</em></p></li>
|
||||
<li><p>Math and logic operations</p></li>
|
||||
<li><p>Functions and libraries</p></li>
|
||||
<li><p>Reading data into R</p></li>
|
||||
<li>What is R and why should you use it?</li>
|
||||
<li>The RStudio interface</li>
|
||||
<li>File types</li>
|
||||
<li>Error messages</li>
|
||||
<li>Variables</li>
|
||||
<li>Types & data structures</li>
|
||||
<li>Math and logic operations</li>
|
||||
<li>Functions and libraries</li>
|
||||
<li>Reading data into R</li>
|
||||
</ol>
|
||||
</section>
|
||||
<section>
|
||||
|
|
@ -2821,7 +2820,7 @@ types/structures</strong> (ex. nested lists)</li>
|
|||
<section id="min-break" class="title-slide slide level1">
|
||||
<h1>10 min break</h1>
|
||||
<center>
|
||||
<div class="countdown" id="timer_0b16aa2b" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:5%;padding:50px;font-size:5em;position: relative; width: min-content;">
|
||||
<div class="countdown" id="timer_386e6f50" data-update-every="1" tabindex="0" style="right:0;bottom:0;margin:5%;padding:50px;font-size:5em;position: relative; width: min-content;">
|
||||
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
|
||||
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
|
||||
</div>
|
||||
|
|
@ -2966,131 +2965,6 @@ RNA-Seq Analysis Using R</a>
|
|||
<li>Feb 1, 2024 (9:30am-12:00pm)</li>
|
||||
</ul></li>
|
||||
</ol>
|
||||
</section></section>
|
||||
<section>
|
||||
<section id="chatgpt-tips-for-r" class="title-slide slide level1">
|
||||
<h1>ChatGPT Tips for R</h1>
|
||||
|
||||
</section>
|
||||
<section id="general-tips" class="slide level2">
|
||||
<h2>General Tips</h2>
|
||||
<ul>
|
||||
<li>Always confirm ChatGPT’s outputs are correct</li>
|
||||
<li>Provide as much detail as possible about the problem in the 1st
|
||||
prompt</li>
|
||||
<li>Use separate chats for separate tasks/projects</li>
|
||||
<li>Try the ‘Custom Instructions’ function that adds additional
|
||||
information to every prompt</li>
|
||||
<li>Can visit webpages (GPT 4 only), which can help get more specific
|
||||
answers</li>
|
||||
</ul>
|
||||
</section>
|
||||
<section id="code-tips" class="slide level2">
|
||||
<h2>Code Tips</h2>
|
||||
<ul>
|
||||
<li>Commented R code yields better responses in my experience</li>
|
||||
<li>Provide the code and error message in the same prompt</li>
|
||||
<li>ChatGPT can work well to convert syntax and improve your code:
|
||||
<ul>
|
||||
<li>“Turn this loop into a function : [your code]”</li>
|
||||
<li>“Is there a better way to do this : [your code]”</li>
|
||||
</ul></li>
|
||||
<li>Check out the file:
|
||||
<code>example_code/1_convert_syntax_example.R</code> for an example use
|
||||
case</li>
|
||||
</ul>
|
||||
</section></section>
|
||||
<section>
|
||||
<section id="finding-r-packages" class="title-slide slide level1">
|
||||
<h1>Finding R Packages</h1>
|
||||
|
||||
</section>
|
||||
<section id="key-questions" class="slide level2">
|
||||
<h2>Key Questions</h2>
|
||||
<ul>
|
||||
<li>What assay was the package designed for?</li>
|
||||
<li>When was the last release?</li>
|
||||
<li>Is it maintained (frequent updates)?</li>
|
||||
<li>Does it work on all operating systems?</li>
|
||||
<li>Are other people using it? (citations)</li>
|
||||
<li>Do they respond to github issues?</li>
|
||||
<li>Is there a benchmarking paper?</li>
|
||||
</ul>
|
||||
</section>
|
||||
<section id="bioconductor-and-cran" class="slide level2">
|
||||
<h2>BioConductor and CRAN</h2>
|
||||
<ul>
|
||||
<li><p>Both of these have stringent requirements for packages they host
|
||||
(eg. for BioConductor they have to run on all major operating
|
||||
systems)</p></li>
|
||||
<li><p>Prefer BioConductor packages if available over CRAN</p></li>
|
||||
<li><p>Prefer CRAN packages over ones only hosted on GitHub</p></li>
|
||||
</ul>
|
||||
</section>
|
||||
<section id="start-with-the-assay" class="slide level2">
|
||||
<h2>Start with the Assay</h2>
|
||||
<ul>
|
||||
<li>Click <a href="https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing">here</a>
|
||||
to go to BioC views</li>
|
||||
<li>Pick the assay you want to analyse</li>
|
||||
<li>Pick the type of analysis you want to do</li>
|
||||
<li>Find a package that does it</li>
|
||||
<li>Find benchmarking papers to narrow the list of packages down</li>
|
||||
<li>Find the vignette on the package page and refer to the manual for
|
||||
any questions not covered by it</li>
|
||||
</ul>
|
||||
</section></section>
|
||||
<section>
|
||||
<section id="additional-resources" class="title-slide slide level1">
|
||||
<h1>Additional Resources</h1>
|
||||
|
||||
</section>
|
||||
<section id="r-1" class="slide level2">
|
||||
<h2>R</h2>
|
||||
<ul>
|
||||
<li><p><a href="https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html">R
|
||||
Markdown: The Definitive Guide</a> : Excellent R markdown
|
||||
reference</p></li>
|
||||
<li><p><a href="https://r4ds.hadley.nz/">R for Data Science</a></p></li>
|
||||
<li><p><a href="https://ggplot2-book.org/">ggplot2: elegant graphics for
|
||||
data analysis</a></p></li>
|
||||
<li><p><a href="https://adv-r.hadley.nz/">Advanced R</a></p></li>
|
||||
</ul>
|
||||
</section>
|
||||
<section id="statistics" class="slide level2">
|
||||
<h2>Statistics</h2>
|
||||
<ul>
|
||||
<li><a href="https://bookdown.org/steve_midway/DAR">Data Analysis in
|
||||
R</a> : This book has more statistics details than <em>R for Data
|
||||
Science</em></li>
|
||||
<li><a href="https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html">Generalized
|
||||
Linear Models</a><br />
|
||||
</li>
|
||||
<li><a href="https://bookdown.org/steve_midway/DAR/random-effects.html">Random
|
||||
Effects</a></li>
|
||||
</ul>
|
||||
</section>
|
||||
<section id="rna-seq-analysis" class="slide level2">
|
||||
<h2>RNA-seq Analysis</h2>
|
||||
<ul>
|
||||
<li><a href="https://rnaseq.uoregon.edu/">RNA-seqlopedia</a> :
|
||||
Everything you need to know about RNA-seq experiments</li>
|
||||
<li><a href="https://luisvalesilva.com/datasimple/rna-seq_units.html">RNA-seq
|
||||
Expression Units</a> : Blog post on understanding common units</li>
|
||||
<li><a href="https://bioconductor.org/books/3.17/OSCA.intro/index.html">Introduction
|
||||
to Single-Cell Analysis with Bioconductor</a> : Covers the basics of
|
||||
scRNA-seq analysis in R</li>
|
||||
</ul>
|
||||
</section>
|
||||
<section id="dimensional-reduction" class="slide level2">
|
||||
<h2>Dimensional Reduction</h2>
|
||||
<ul>
|
||||
<li><a href="https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/">Tutorial
|
||||
on PCA</a> : PCA explained with R code examples</li>
|
||||
<li><a href="https://pair-code.github.io/understanding-umap/">Understanding
|
||||
UMAP</a> : Short explanation with great visualizations, mainly useful
|
||||
for scRNA-seq analysis</li>
|
||||
</ul>
|
||||
</section></section>
|
||||
</div>
|
||||
</div>
|
||||
|
|
|
|||
10031
docs/Intro_to_R_data_analysis_part_2.html
Normal file
10031
docs/Intro_to_R_data_analysis_part_2.html
Normal file
File diff suppressed because one or more lines are too long
|
|
@ -8,6 +8,7 @@
|
|||
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_Unix_Part_1.html">Introduction to Unix - Part 1</li>
|
||||
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_Unix_Part_2.html">Introduction to Unix - Part 2</li>
|
||||
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_R_data_analysis_part_1.html">Introduction to R Data Analysis - Part 1</li>
|
||||
<li><a href="https://gladstone-institutes.github.io/Bioinformatics-Workshops/Intro_to_R_data_analysis_part_2.html">Introduction to R Data Analysis - Part 2</li>
|
||||
</ul>
|
||||
</body>
|
||||
</html>
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
title: "Introduction to R Data Analysis - Part 1"
|
||||
author: "Natalie Elphick"
|
||||
date: "January 22nd"
|
||||
date: "January 22nd, 2024"
|
||||
knit: (function(input, ...) {
|
||||
rmarkdown::render(
|
||||
input,
|
||||
|
|
@ -47,8 +47,8 @@ Bioinformatician I
|
|||
5. Variables
|
||||
6. Types & data structures
|
||||
7. Math and logic operations
|
||||
8. Functions and libraries
|
||||
9. Reading data into R
|
||||
8. Functions and packages
|
||||
|
||||
|
||||
# What is R?
|
||||
|
||||
|
|
@ -329,81 +329,3 @@ packages
|
|||
- Feb 1, 2024 (9:30am-12:00pm)
|
||||
|
||||
|
||||
# ChatGPT Tips for R
|
||||
|
||||
## General Tips
|
||||
|
||||
- Always confirm ChatGPT's outputs are correct
|
||||
- Provide as much detail as possible about the problem in the 1st prompt
|
||||
- Use separate chats for separate tasks/projects
|
||||
- Try the 'Custom Instructions' function that adds additional information to every prompt
|
||||
- Can visit webpages (GPT 4 only), which can help get more specific answers
|
||||
|
||||
## Code Tips
|
||||
|
||||
- Commented R code yields better responses in my experience
|
||||
- Provide the code and error message in the same prompt
|
||||
- ChatGPT can work well to convert syntax and improve your code:
|
||||
- "Turn this loop into a function : [your code]"
|
||||
- "Is there a better way to do this : [your code]"
|
||||
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
|
||||
|
||||
# Finding R Packages
|
||||
|
||||
## Key Questions
|
||||
|
||||
- What assay was the package designed for?
|
||||
- When was the last release?
|
||||
- Is it maintained (frequent updates)?
|
||||
- Does it work on all operating systems?
|
||||
- Are other people using it? (citations)
|
||||
- Do they respond to github issues?
|
||||
- Is there a benchmarking paper?
|
||||
|
||||
## BioConductor and CRAN
|
||||
|
||||
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
|
||||
|
||||
- Prefer BioConductor packages if available over CRAN
|
||||
|
||||
- Prefer CRAN packages over ones only hosted on GitHub
|
||||
|
||||
## Start with the Assay
|
||||
|
||||
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
|
||||
- Pick the assay you want to analyse
|
||||
- Pick the type of analysis you want to do
|
||||
- Find a package that does it
|
||||
- Find benchmarking papers to narrow the list of packages down
|
||||
- Find the vignette on the package page and refer to the manual for any questions not covered by it
|
||||
|
||||
|
||||
# Additional Resources
|
||||
|
||||
## R
|
||||
|
||||
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
|
||||
|
||||
- [R for Data Science](https://r4ds.hadley.nz/)
|
||||
|
||||
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
|
||||
|
||||
- [Advanced R](https://adv-r.hadley.nz/)
|
||||
|
||||
## Statistics
|
||||
|
||||
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
|
||||
- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\
|
||||
- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html)
|
||||
|
||||
## RNA-seq Analysis
|
||||
|
||||
- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments
|
||||
- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units
|
||||
- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R
|
||||
|
||||
## Dimensional Reduction
|
||||
|
||||
- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples
|
||||
- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis
|
||||
|
||||
|
|
|
|||
348
intro-r-data-analysis/Intro_to_R_data_analysis_part_2.Rmd
Normal file
348
intro-r-data-analysis/Intro_to_R_data_analysis_part_2.Rmd
Normal file
|
|
@ -0,0 +1,348 @@
|
|||
---
|
||||
title: "Introduction to R Data Analysis - Part 2"
|
||||
author: "Natalie Elphick"
|
||||
date: "January 23rd, 2024"
|
||||
knit: (function(input, ...) {
|
||||
rmarkdown::render(
|
||||
input,
|
||||
output_dir = "../docs"
|
||||
)
|
||||
})
|
||||
output:
|
||||
revealjs::revealjs_presentation:
|
||||
theme: simple
|
||||
css: style.css
|
||||
---
|
||||
|
||||
```{r, setup, include=FALSE}
|
||||
library(kableExtra)
|
||||
library(tidyverse)
|
||||
library(readxl)
|
||||
theme_set(theme_grey(base_size = 16))
|
||||
```
|
||||
|
||||
##
|
||||
|
||||
<center>*Press the ? key for tips on navigating these slides*</center>
|
||||
|
||||
# Schedule
|
||||
|
||||
1. Introduction to Tidyverse
|
||||
2. Filtering and reformatting data
|
||||
3. Plotting data
|
||||
4. Hands on data analysis
|
||||
|
||||
# Introduction to Tidyverse
|
||||
|
||||
## Tidyverse
|
||||
|
||||
- The tidyverse packages work well together because they share
|
||||
common data representations and design principles
|
||||
- Rows = observations, columns = variables
|
||||
- [ggplot2](), for data visualization.
|
||||
- [dplyr](), for data manipulation.
|
||||
- [tidyr](), for data tidying.
|
||||
- [readr](), for data import.
|
||||
- [purrr](), for iteration.
|
||||
- and more..
|
||||
|
||||
## dplyr
|
||||
- Offers a common “grammar” of functions for data manipulation
|
||||
- [mutate()](https://dplyr.tidyverse.org/reference/mutate.html) adds new variables that are functions of existing
|
||||
columns
|
||||
- [select()](https://dplyr.tidyverse.org/reference/select.html) picks columns based on their names
|
||||
- [filter()](https://dplyr.tidyverse.org/reference/filter.html) picks rows based on their values
|
||||
- [summarise()](https://dplyr.tidyverse.org/reference/summarise.html) reduces multiple values down to a single summary
|
||||
- [arrange()](https://dplyr.tidyverse.org/reference/arrange.html) changes the ordering of the rows
|
||||
- [group_by()](https://dplyr.tidyverse.org/reference/group_by.html) allows any operation to be done “by group”
|
||||
|
||||
|
||||
|
||||
## Example Dataframe
|
||||
- mpg is a dataframe built into the ggplot2 package
|
||||
```{r, eval = FALSE}
|
||||
head(mpg)
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
head(mpg) |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
|
||||
## Select Columns
|
||||
|
||||
```{r, eval = FALSE}
|
||||
select(.data = mpg,
|
||||
year, cty, hwy, manufacturer)
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
select(.data = mpg,
|
||||
year, cty, hwy, manufacturer) |>
|
||||
head() |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
|
||||
|
||||
## Filter Rows
|
||||
|
||||
|
||||
```{r, eval = FALSE}
|
||||
filter(.data = mpg,
|
||||
year == 2008)
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
filter(.data = mpg,
|
||||
year == 2008) |>
|
||||
head() |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
## Arrange Rows
|
||||
|
||||
- desc() is used to arrange rows in descending order, the default is ascending
|
||||
```{r, eval = FALSE}
|
||||
arrange(.data = mpg,
|
||||
desc(cyl))
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
arrange(.data = mpg,
|
||||
desc(cyl)) |>
|
||||
head(n = 3) |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
## Summarising data
|
||||
- The dplyr **summarize()**function computes a table of
|
||||
summaries for a data frame
|
||||
- **group_by()** groups the input data frame by the specified
|
||||
variable(s)
|
||||
- Combining these two allows us to easily create summaries for
|
||||
different categorical groupings
|
||||
|
||||
## Group and Summarise
|
||||
```{r, eval = FALSE}
|
||||
summarise(group_by(.data = mpg,
|
||||
manufacturer),
|
||||
mean_cty = mean(cty),
|
||||
median_cty = median(cty))
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
summarise(group_by(.data = mpg,
|
||||
manufacturer),
|
||||
mean_cty = mean(cty),
|
||||
median_cty = median(cty)) |>
|
||||
head() |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
|
||||
## The pipe operator |>
|
||||
- Allows "chaining" of function calls to make code more readable
|
||||
```{r, eval = FALSE}
|
||||
mpg |>
|
||||
group_by(manufacturer) |>
|
||||
summarise(mean_cty = mean(cty),
|
||||
median_cty = median(cty))
|
||||
```
|
||||
|
||||
```{r, echo = FALSE}
|
||||
mpg |>
|
||||
group_by(manufacturer) |>
|
||||
summarise(mean_cty = mean(cty),
|
||||
median_cty = median(cty)) |>
|
||||
head(n = 4) |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
|
||||
|
||||
# Plotting
|
||||
|
||||
## ggplot2
|
||||
- The most popular tidyverse package
|
||||
- Create publication quality, highly customizable plots
|
||||
- See the [R graph gallery](https://r-graph-gallery.com/index.html) for examples
|
||||
- ggplots use “layers” to build, modify and overlap visualizations
|
||||
- Layers are added using the + symbol and can be added to an existing ggplot
|
||||
- Many popular packages output ggplots which can then be easily modified by adding layers
|
||||
|
||||
|
||||
## Creating ggplots
|
||||
|
||||
<br>
|
||||
</br>
|
||||

|
||||
|
||||
|
||||
## Plot Example
|
||||
|
||||
```{r, fig.dim=c(6,4)}
|
||||
ggplot(data = mpg, # Input dataframe
|
||||
mapping = aes(x = cty, y = hwy)) + # Aesthetic mapping
|
||||
geom_point() # Point graph
|
||||
```
|
||||
|
||||
## Adding and Modifying Layers
|
||||
|
||||
```{r, fig.dim=c(10,4)}
|
||||
ggplot(data = mpg,
|
||||
mapping = aes(x = class, y = cty, fill = class)) +
|
||||
geom_violin() +
|
||||
geom_boxplot(width = 0.1,
|
||||
fill = "white")
|
||||
```
|
||||
|
||||
|
||||
# 10 min break
|
||||
|
||||
<center>
|
||||
|
||||
```{r, echo=FALSE}
|
||||
|
||||
countdown::countdown(minutes = 10,
|
||||
seconds = 0,
|
||||
color_border = "black",
|
||||
padding = "50px",
|
||||
margin = "5%",
|
||||
font_size = "5em",
|
||||
style = "position: relative; width: min-content;")
|
||||
```
|
||||
|
||||
</center>
|
||||
|
||||
|
||||
# Hands-on Data Analysis
|
||||
|
||||
## Dataset Description
|
||||
- PanTHERIA
|
||||
- A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature
|
||||
- Macroecological and macroevolutionary research projects
|
||||
- Data is organized by taxonomic rank
|
||||
|
||||
## Taxonomic Rank
|
||||
|
||||

|
||||
|
||||
## Data Preview
|
||||
|
||||
```{r, echo = FALSE}
|
||||
read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
|
||||
head() |>
|
||||
kable() |>
|
||||
kable_styling("striped") |>
|
||||
scroll_box(width = "100%")
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Hands-on Analysis
|
||||
- Open part_2.Rmd
|
||||
|
||||
|
||||
|
||||
# ChatGPT Tips for R
|
||||
|
||||
## General Tips
|
||||
|
||||
- Always confirm ChatGPT's outputs are correct
|
||||
- Provide as much detail as possible about the problem in the 1st prompt
|
||||
- Use separate chats for separate tasks/projects
|
||||
- Try the 'Custom Instructions' function that adds additional information to every prompt
|
||||
- Can visit webpages (GPT 4 only), which can help get more specific answers
|
||||
|
||||
## Code Tips
|
||||
|
||||
- Commented R code yields better responses in my experience
|
||||
- Provide the code and error message in the same prompt
|
||||
- ChatGPT can work well to convert syntax and improve your code:
|
||||
- "Turn this loop into a function : [your code]"
|
||||
- "Is there a better way to do this : [your code]"
|
||||
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
|
||||
|
||||
# Finding R Packages
|
||||
|
||||
## Key Questions
|
||||
|
||||
- What assay was the package designed for?
|
||||
- When was the last release?
|
||||
- Is it maintained (frequent updates)?
|
||||
- Does it work on all operating systems?
|
||||
- Are other people using it? (citations)
|
||||
- Do they respond to github issues?
|
||||
- Is there a benchmarking paper?
|
||||
|
||||
## BioConductor and CRAN
|
||||
|
||||
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
|
||||
|
||||
- Prefer BioConductor packages if available over CRAN
|
||||
|
||||
- Prefer CRAN packages over ones only hosted on GitHub
|
||||
|
||||
## Start with the Assay
|
||||
|
||||
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
|
||||
- Pick the assay you want to analyse
|
||||
- Pick the type of analysis you want to do
|
||||
- Find a package that does it
|
||||
- Find benchmarking papers to narrow the list of packages down
|
||||
- Find the vignette on the package page and refer to the manual for any questions not covered by it
|
||||
|
||||
|
||||
# Additional Resources
|
||||
|
||||
## R
|
||||
|
||||
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
|
||||
|
||||
- [R for Data Science](https://r4ds.hadley.nz/)
|
||||
|
||||
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
|
||||
|
||||
- [Advanced R](https://adv-r.hadley.nz/)
|
||||
|
||||
## Statistics
|
||||
|
||||
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
|
||||
- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\
|
||||
- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html)
|
||||
|
||||
## RNA-seq Analysis
|
||||
|
||||
- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments
|
||||
- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units
|
||||
- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R
|
||||
|
||||
## Dimensional Reduction
|
||||
|
||||
- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples
|
||||
- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis
|
||||
|
||||
|
||||
|
||||
# End of Part 2
|
||||
|
||||
## Workshop survey
|
||||
- Please fill out our [workshop survey](https://www.surveymonkey.com/r/F75J6VZ) so we can continue to improve these workshops
|
||||
|
||||
## Upcoming Workshops
|
||||
|
||||
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0)
|
||||
- Jan 25, 2024 (Session 1 - 10am–12pm) (Session 2 - 1pm–3pm)
|
||||
- Jan 26, 2024 (Session 3 - 10am–12pm)
|
||||
|
||||
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4)
|
||||
- Feb 1, 2024 (9:30am-12:00pm)
|
||||
|
||||
Binary file not shown.
1341
intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html
Normal file
1341
intro-r-data-analysis/Intro_to_R_workshop_materials/part_2.html
Normal file
File diff suppressed because one or more lines are too long
|
|
@ -0,0 +1,259 @@
|
|||
---
|
||||
title: "Intro to R Data Analysis: Part 2"
|
||||
output: html_document # knitr report document type
|
||||
date: "`r Sys.Date()`" # This will update the date everytime you knit the doc
|
||||
---
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE)
|
||||
|
||||
# Load packages
|
||||
library(dplyr) # tidyverse data frame manipulation package
|
||||
library(tidyr) # functions to help clean data
|
||||
library(magrittr) # this package provides the pipe operator %>%
|
||||
library(readxl) # read excel files
|
||||
library(ggplot2) # highly customizable plots
|
||||
```
|
||||
|
||||
## R Markdown
|
||||
|
||||
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. Guide to markdown syntax <https://www.markdownguide.org/basic-syntax/>.
|
||||
|
||||
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
|
||||
|
||||
```{r}
|
||||
# Simulates 100 observations from a normal distribution
|
||||
# and plots a histogram
|
||||
val <- rnorm(n = 100)
|
||||
hist(val, breaks = 20)
|
||||
```
|
||||
|
||||
```{r}
|
||||
# The ggplot version of the same plot
|
||||
ggplot(data = tibble(values = val),
|
||||
mapping = aes(x = values))+
|
||||
geom_histogram(bins = 20)
|
||||
```
|
||||
|
||||
|
||||
|
||||
**Important: before running the code below, click Session -> Set Working Directory -> To Source File Location**
|
||||
|
||||
|
||||
## Exercise 3: Reading in Data
|
||||
|
||||
The data we will be analyzing is from the PanTHERIA database which is "a global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals (PanTHERIA) developed for a number of macroecological and macroevolutionary research projects."
|
||||
|
||||
```{r}
|
||||
# The data is spread across 3 sheets in an excel file. We need to
|
||||
# combine these data into one table/data frame.
|
||||
|
||||
# na = "NA" tells read_xlsx how missing values appear in the data
|
||||
# the default is empty cells. Run "?read_xlsx" for more info
|
||||
sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
|
||||
sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
|
||||
sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
|
||||
|
||||
# rbind (row-bind) combines data frames by row
|
||||
pantheria <- rbind(sheet1, sheet2, sheet3)
|
||||
```
|
||||
|
||||
|
||||
```{r}
|
||||
# How many rows and columns are there?
|
||||
nrow(pantheria)
|
||||
ncol(pantheria)
|
||||
```
|
||||
|
||||
|
||||
```{r}
|
||||
# What does the data look like?
|
||||
head(pantheria)
|
||||
```
|
||||
|
||||
|
||||
## Exercise 4: Filtering and Reformatting Data
|
||||
|
||||
We will exploring adult body mass from these mammals as it relates to their trophic level using `dpylr` and `ggplot2`. Download the cheatsheets for these packages at the following links:
|
||||
|
||||
* [dplyr cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf)
|
||||
* [ggplot2 cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf)
|
||||
|
||||
Let's start by subsetting the data with `select()`
|
||||
```{r}
|
||||
# Pipes (%>%) work by passing the data in front of the pipe to the first argument
|
||||
# of the function after it, this prevents a lot of nested function calls and makes
|
||||
# code easier to read.
|
||||
|
||||
pantheria <- pantheria %>% # Passes pantheria as the first argument of select
|
||||
select(Order,
|
||||
Family, # select returns the specified columns
|
||||
Genus,
|
||||
Species,
|
||||
TrophicLevel,
|
||||
AdultBodyMass_g) %>%
|
||||
drop_na() %>% # Remove any rows that have NAs
|
||||
distinct() # Remove any duplicate rows
|
||||
```
|
||||
|
||||
Data is almost never clean, for example there should be only 3 trophic levels:
|
||||
```{r}
|
||||
unique(pantheria$TrophicLevel) # unique elements of a vector
|
||||
```
|
||||
|
||||
Let's fix the TrophicLevel column using `mutate()`
|
||||
|
||||
```{r}
|
||||
# mutate allows us to add columns or modify existing ones
|
||||
pantheria <- pantheria %>%
|
||||
mutate(TrophicLevel = tolower(TrophicLevel)) # Make column lowercase
|
||||
```
|
||||
|
||||
|
||||
## Exercise 5: Summarizing data
|
||||
|
||||
Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation.
|
||||
|
||||
```{r}
|
||||
pantheria %>%
|
||||
group_by(TrophicLevel) %>% # Group observations by this column
|
||||
summarize(Mean = mean(AdultBodyMass_g), # Summarize will calculate these group wise
|
||||
`Standard Deviation` = sd(AdultBodyMass_g), # Quasi quotation lets us add spaces to column names
|
||||
Min = min(AdultBodyMass_g),
|
||||
Max = max(AdultBodyMass_g)) %>%
|
||||
ungroup() %>%
|
||||
arrange(desc(Mean)) # Order the data frame by descending mean body mass
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Exercise 6: Plotting Data
|
||||
|
||||
According to the table above, body masses have a really wide range across trophic levels. Let's visualize the distribution of adult body masses.
|
||||
|
||||
```{r}
|
||||
# ggplot2 constructs graphics in layers, each layer is separated by "+"
|
||||
# x and y values are supplied in aes(), the type of plot is specified using
|
||||
# the "geom" functions
|
||||
|
||||
ggplot(data = pantheria, # input data
|
||||
mapping = aes(x = log10(AdultBodyMass_g))) + # log10 transform adult body mass
|
||||
geom_histogram(fill = "#CE3274", # type of plot
|
||||
bins = 40) +
|
||||
xlab(label = "log10 Adult Body Mass (g)") + # x label
|
||||
ylab(label = "Frequency") + # y label
|
||||
labs(title = "Histogram of log10 Adult Body Mass") # title
|
||||
```
|
||||
|
||||
The data looks skewed even after log10 transformation. Let's view the distribution by trophic level.
|
||||
|
||||
```{r}
|
||||
pantheria %>%
|
||||
ggplot(aes(x = log10(AdultBodyMass_g), fill = TrophicLevel)) + # Color by trophic level
|
||||
geom_histogram(bins = 40) +
|
||||
facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level
|
||||
ylab(label = "Frequency") +
|
||||
xlab(label = "log10 Adult Body Mass (g)") +
|
||||
labs(title = "Histograms of log10 Adult Body Mass") +
|
||||
theme(plot.title = element_text(hjust = 0.5)) # Center the plot title
|
||||
```
|
||||
|
||||
It is clear that trophic level does have an impact on the distribution of adult body mass, carnivores tend to be smaller which makes sense because carnivores have higher metabolic demands and so there might be a selection pressure towards smaller carnivores. If we wanted to confirm this by fitting a model, we could use the `lm()` function to fit a linear model.
|
||||
|
||||
## Exercise 7.1: Hands on coding
|
||||
|
||||
An important caveat of the data is that some Orders of mammals are more biodiverse than others and are therefore over represented in the dataset. Using the `dplyr` cheatsheet, write code generates a table of Orders and what percentage of the data they are. Scroll down to see the hint if you are having trouble.
|
||||
|
||||
```{r}
|
||||
# Your code
|
||||
pantheria %>%
|
||||
group_by(Order) %>%
|
||||
summarise(n = n()) %>%
|
||||
arrange(desc(n))
|
||||
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
*Exercise 7.1 hint: group_by + summarize(n = n()) + arrange*
|
||||
|
||||
## Exercise 7.2: Hands on coding
|
||||
|
||||
Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions.
|
||||
|
||||
|
||||
```{r}
|
||||
|
||||
top_orders <- c("Rodentia", "Chiroptera") # Character vector of the top 2 Orders from above
|
||||
# filter uses a conditional to select rows from the data
|
||||
pantheria %>%
|
||||
filter(Order %in% top_orders) %>%
|
||||
ggplot(aes(x = log10(AdultBodyMass_g), fill = Order)) +
|
||||
geom_histogram(bins = 40) +
|
||||
facet_grid(rows = vars(TrophicLevel),
|
||||
cols = vars(Order)) +
|
||||
ylab(label = "Frequency") +
|
||||
xlab(label = "log10 Adult Body Mass (g)") +
|
||||
theme(plot.title = element_text(hjust = 0.5))
|
||||
```
|
||||
|
||||
It looks like one of them is mostly made up of small carnivores. Let's remove it and redo the plot of body mass distribution by trophic level.
|
||||
|
||||
```{r}
|
||||
pantheria %>%
|
||||
filter(Order != "Chiroptera") %>%
|
||||
ggplot(aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) + # Color by trophic level
|
||||
geom_histogram(bins = 40) +
|
||||
facet_grid(rows = vars(TrophicLevel)) + # Split the plot into rows by trophic level
|
||||
ylab(label = "Frequency") +
|
||||
xlab(label = "log10 Adult Body Mass (g)") +
|
||||
labs(title = "Histograms of log10 Adult Body Mass")
|
||||
```
|
||||
|
||||
We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*. Now that we have generated these plots, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.
|
||||
|
||||
## End of workshop exercises
|
||||
|
||||
Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the [workshop wiki](https://github.com/gladstone-institutes/Bioinformatics-Workshops/wiki/Introduction-to-R-for-Data-Analysis). R also contains many built in datasets you can use for practice:
|
||||
```{r, echo=FALSE}
|
||||
available_datasets <- data()
|
||||
available_datasets$results %>%
|
||||
as_tibble() %>%
|
||||
select(-LibPath) %>% knitr::kable()
|
||||
```
|
||||
|
||||
|
||||
|
||||
File diff suppressed because one or more lines are too long
303
intro-r-data-analysis/assets/Taxonomic_Rank_Graph.svg
Normal file
303
intro-r-data-analysis/assets/Taxonomic_Rank_Graph.svg
Normal file
File diff suppressed because one or more lines are too long
|
After Width: | Height: | Size: 122 KiB |
BIN
intro-r-data-analysis/assets/plotting.png
Normal file
BIN
intro-r-data-analysis/assets/plotting.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.1 MiB |
|
|
@ -14,7 +14,7 @@
|
|||
.reveal pre code {
|
||||
background-color: #d5d5d5 !important;
|
||||
color: #333 !important;
|
||||
font-size: 1.5em !important;
|
||||
font-size: 1.25em !important;
|
||||
}
|
||||
/* Left-align all code outputs */
|
||||
.reveal pre code {
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue