This commit is contained in:
Natalie Elphick 2024-05-18 08:18:12 -07:00
parent 094d41cd0d
commit fd2fc5b190
6 changed files with 1040 additions and 1502 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View file

@ -1,7 +1,8 @@
--- ---
title: "Introduction to R Data Analysis - Part 1" title: "Introduction to R Data Analysis"
subtitle: "Part 1"
author: "Natalie Elphick" author: "Natalie Elphick"
date: "January 22nd, 2024" date: "May 20th, 2024"
knit: (function(input, ...) { knit: (function(input, ...) {
rmarkdown::render( rmarkdown::render(
input, input,
@ -16,6 +17,7 @@ output:
```{r, setup, include=FALSE} ```{r, setup, include=FALSE}
library(tidyverse) library(tidyverse)
knitr::opts_chunk$set(comment = "")
``` ```
## ##
@ -27,8 +29,8 @@ library(tidyverse)
**Natalie Elphick** **Natalie Elphick**
Bioinformatician I Bioinformatician I
**Michela Traglia (TA)** **Yihang Xin (Online TA)**
Senior Statistician Software Engineer III
## Poll 1 ## Poll 1
@ -36,7 +38,7 @@ Senior Statistician
**What is your level of experience with coding/data analysis?** **What is your level of experience with coding/data analysis?**
1. I know another data analysis programming language (Python, Matlab etc.) 1. I know another data analysis programming language (Python, Matlab etc.)
2. I can use Excel to do linear regression 2. I can use Excel
3. I know some R 3. I know some R
4. All of the above 4. All of the above
5. None of the above 5. None of the above
@ -50,8 +52,8 @@ Senior Statistician
1. What is R and why should you use it? 1. What is R and why should you use it?
2. The RStudio interface 2. The RStudio interface
3. File types 3. File types
4. Error messages 4. Variables
5. Variables 5. Error and warning messages
6. Types & data structures 6. Types & data structures
7. Math and logic operations 7. Math and logic operations
8. Functions and packages 8. Functions and packages
@ -86,9 +88,9 @@ functionality
# RStudio # RStudio
## RStudio ## RStudio
- RStudio is an integrated development - RStudio is an integrated development
environment (IDE) environment (IDE)
- It makes R code easier to write by providing a - It is an app that makes R code easier to write by providing a
feature rich graphical user interface (GUI) feature rich graphical user interface (GUI)
<br> <br>
@ -131,8 +133,6 @@ feature rich graphical user interface (GUI)
## Variable definition ## Variable definition
- Variables store information that is referenced and manipulated - Variables store information that is referenced and manipulated
in a computer program in a computer program
- In contrast to the mathematical definition of a variable,
variables in computer science are _mutable_
- There are 3 ways to define variables in R, but one is preferred: - There are 3 ways to define variables in R, but one is preferred:
```{r} ```{r}
x <- 1 # Preferred way x <- 1 # Preferred way
@ -141,7 +141,59 @@ x = 1
print(x) print(x)
``` ```
## Variable naming ## Example
- Run the following in the R console:
```{r}
x <- 1
y <- 4
z <- y
x + y + z
```
# Error and Warning Messages
## Errors
- **Errors**: Stop the execution of your code and must be fixed for the code to run successfully
```{r, eval=FALSE}
x <- 5
y <- 10
z <- x + a
```
```{r,echo=FALSE}
message("Error: object 'a' not found")
```
## Common Errors
- **Syntax Error:** Invalid R code syntax (e.g. misplaced parentheses)
```{r,echo=FALSE}
message('Error: unexpected ")"')
```
- **Object not found:** This variable is not defined (e.g. misspelled variables)
```{r,echo=FALSE}
message('Error: object "a" not found')
```
See this [article](https://statsandr.com/blog/top-10-errors-in-r/) for more common errors and how to fix them.
## Warnings
- Do not stop the execution but indicate potential issues that you should be aware of and might need to address
```{r}
a <- c(1, 2, 3, 4, 5)
b <- c(6, 7, 8, 9)
result <- a + b
```
## Variable Naming
- Variables names must start with a letter and can contain - Variables names must start with a letter and can contain
underscores and periods underscores and periods
@ -176,7 +228,7 @@ DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
## Data Types ## Data Types
- Integer - Integer
- Whole numbers (in R denote with L ex. 1L,2L) - Whole numbers (in R denoted with L ex. 1L,2L)
- Numeric - Numeric
- Decimal numbers - Decimal numbers
- Logical - Logical
@ -191,7 +243,7 @@ DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
**Which of these is not the correct data type for the value?** **Which of these is not the correct data type for the value?**
1. 1.5 - Numeric 1. 1.5 - Numeric
2. "Labrador Retriever" - Character 2. "1" - Character
3. NA - Logical 3. NA - Logical
4. 1 - Integer 4. 1 - Integer
@ -227,6 +279,8 @@ DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
countdown::countdown(minutes = 10, countdown::countdown(minutes = 10,
seconds = 0, seconds = 0,
color_border = "black", color_border = "black",
color_running_background = "#47d193",
color_finished_background = "#a3184e",
padding = "50px", padding = "50px",
margin = "5%", margin = "5%",
font_size = "5em", font_size = "5em",
@ -297,6 +351,7 @@ x & !y
execution of code execution of code
```{r} ```{r}
dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog")
if ("Akita" %in% dog_breeds) { if ("Akita" %in% dog_breeds) {
print("dog_breeds already contains Akita") print("dog_breeds already contains Akita")
} else { } else {
@ -379,11 +434,17 @@ library(ggplot2) # Makes all of the ggplot2 functions available
- The tidyverse is a collection of commonly used data analysis - The tidyverse is a collection of commonly used data analysis
packages packages
- Learning curve is less steep - Learning curve is less steep
- Lots of useful packages for data analysis - Lots of useful packages for cleaning and "wrangling" data into the correct format
## ## Why use Tidyverse Packages?
- Most of the work in data analysis is getting data into the correct format to create outputs
- The tidyverse collection of packages simplifies this process
- Intuitive syntax
- Comprehensive (data manipulation, cleaning, modeling and graphics)
- Consistent data structure
- Strong community support
![tidyverse](assets/tidyverse.png)
# End of Part 1 # End of Part 1
@ -392,11 +453,8 @@ packages
## Upcoming Workshops ## Upcoming Workshops
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0) [Single Cell ATAC-Seq Data Analysis Part 2](https://gladstone.org/events/single-cell-atac-seq-data-analysis-part-2-1)
- Jan 25, 2024 (Session 1 - 10am12pm) (Session 2 - 1pm3pm)
- Jan 26, 2024 (Session 3 - 10am12pm)
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4) - Check [this link](https://gladstone.org/events?series=data-science-training-program) at the end of the summer for out fall workshop schedule
- Feb 1, 2024 (9:30am-12:00pm)

View file

@ -1,7 +1,8 @@
--- ---
title: "Introduction to R Data Analysis - Part 2" title: "Introduction to R Data Analysis"
subtitle: "Part 2"
author: "Natalie Elphick" author: "Natalie Elphick"
date: "January 23rd, 2024" date: "May 21st, 2024"
knit: (function(input, ...) { knit: (function(input, ...) {
rmarkdown::render( rmarkdown::render(
input, input,
@ -19,6 +20,7 @@ library(kableExtra)
library(tidyverse) library(tidyverse)
library(readxl) library(readxl)
theme_set(theme_grey(base_size = 16)) theme_set(theme_grey(base_size = 16))
knitr::opts_chunk$set(comment = "")
``` ```
## ##
@ -29,7 +31,10 @@ theme_set(theme_grey(base_size = 16))
**Natalie Elphick** **Natalie Elphick**
Bioinformatician I Bioinformatician I
**Yihang Xin (TA)** **Michela Traglia (In Person TA)**
Senior Statistician
**Yihang Xin (Online TA)**
Software Engineer III Software Engineer III
# Schedule # Schedule
@ -46,11 +51,11 @@ Software Engineer III
- The tidyverse packages work well together because they share - The tidyverse packages work well together because they share
common data representations and design principles common data representations and design principles
- Rows = observations, columns = variables - Rows = observations, columns = variables
- [ggplot2](), for data visualization. - [ggplot2](https://ggplot2.tidyverse.org/), for data visualization.
- [dplyr](), for data manipulation. - [dplyr](https://dplyr.tidyverse.org/), for data manipulation.
- [tidyr](), for data tidying. - [tidyr](https://tidyr.tidyverse.org/), for data tidying.
- [readr](), for data import. - [readr](https://readr.tidyverse.org/), for data import.
- [purrr](), for iteration. - [purrr](https://purrr.tidyverse.org/), for iteration.
- and more.. - and more..
## dplyr ## dplyr
@ -67,66 +72,38 @@ common data representations and design principles
## Example Dataframe ## Example Dataframe
- mpg is a dataframe built into the ggplot2 package - mpg is a dataframe built into the ggplot2 package
```{r, eval = FALSE} ```{r}
head(mpg) head(mpg)
``` ```
```{r, echo = FALSE}
head(mpg) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Select Columns ## Select Columns
```{r, eval = FALSE} ```{r}
select(.data = mpg, select(.data = mpg,
year, cty, hwy, manufacturer) year, cty, hwy, manufacturer)
``` ```
```{r, echo = FALSE}
select(.data = mpg,
year, cty, hwy, manufacturer) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Filter Rows ## Filter Rows
```{r, eval = FALSE} ```{r}
filter(.data = mpg, filter(.data = mpg,
year == 2008) year == 2008)
``` ```
```{r, echo = FALSE}
filter(.data = mpg,
year == 2008) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Arrange Rows ## Arrange Rows
- desc() is used to arrange rows in descending order, the default is ascending - desc() is used to arrange rows in descending order, the default is ascending
```{r, eval = FALSE} ```{r}
arrange(.data = mpg, arrange(.data = mpg,
desc(cyl)) desc(cty))
``` ```
```{r, echo = FALSE}
arrange(.data = mpg,
desc(cyl)) |>
head(n = 3) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Summarising data ## Summarising data
- The dplyr **summarise()** function computes a table of - The dplyr **summarise()** function computes a table of
summaries for a data frame summaries for a data frame
@ -136,6 +113,9 @@ variable(s)
different categorical groupings different categorical groupings
## Group and Summarise ## Group and Summarise
- Get the mean and median city mileage within manufacturer
```{r, eval = FALSE} ```{r, eval = FALSE}
summarise(group_by(.data = mpg, summarise(group_by(.data = mpg,
manufacturer), manufacturer),
@ -144,37 +124,27 @@ summarise(group_by(.data = mpg,
``` ```
```{r, echo = FALSE} ```{r, echo = FALSE}
summarise(group_by(.data = mpg, summarise(.data = group_by(.data = mpg,
manufacturer), manufacturer),
mean_cty = mean(cty), mean_cty = mean(cty),
median_cty = median(cty)) |> median_cty = median(cty)) |>
head() |> head(10)
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
``` ```
## The pipe operator |> ## The pipe operator |>
- Allows "chaining" of function calls to make code more readable - Allows "chaining" of function calls to make code more readable
```{r, eval = FALSE} ```{r}
mpg |>
group_by(manufacturer) |>
summarise(mean_cty = mean(cty),
median_cty = median(cty))
```
```{r, echo = FALSE}
mpg |> mpg |>
group_by(manufacturer) |> group_by(manufacturer) |>
summarise(mean_cty = mean(cty), summarise(mean_cty = mean(cty),
median_cty = median(cty)) |> median_cty = median(cty)) |>
head(n = 4) |> head(5)
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
``` ```
# Plotting # Plotting
## ggplot2 ## ggplot2
@ -205,10 +175,9 @@ ggplot(data = mpg, # Input dataframe
```{r, fig.dim=c(10,4)} ```{r, fig.dim=c(10,4)}
ggplot(data = mpg, ggplot(data = mpg,
mapping = aes(x = class, y = cty, fill = class)) + mapping = aes(x = cty, y = hwy)) +
geom_violin() + geom_point() +
geom_boxplot(width = 0.1, geom_smooth(formula = y ~ x, method = "lm")
fill = "white")
``` ```
@ -221,6 +190,8 @@ ggplot(data = mpg,
countdown::countdown(minutes = 10, countdown::countdown(minutes = 10,
seconds = 0, seconds = 0,
color_border = "black", color_border = "black",
color_running_background = "#47d193",
color_finished_background = "#a3184e",
padding = "50px", padding = "50px",
margin = "5%", margin = "5%",
font_size = "5em", font_size = "5em",
@ -234,8 +205,8 @@ countdown::countdown(minutes = 10,
## Dataset Description ## Dataset Description
- PanTHERIA - PanTHERIA
- A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature - A global species-level data set of key traits of all known extant and recently extinct mammals compiled from literature
- Macroecological and macroevolutionary research projects - Used in macroecological and macroevolutionary research projects
- Data is organized by taxonomic rank - Data is organized by taxonomic rank
## Taxonomic Rank ## Taxonomic Rank
@ -252,10 +223,20 @@ read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
scroll_box(width = "100%") scroll_box(width = "100%")
``` ```
## Hands-on Analysis
- We will read in the data and explore if the trophic level has a significant impact on the adult body mass of mammals
Steps:
1. Combine and clean the data
2. Visualize adult body mass by trophic level
3. Check for overrepresented groups
4. Fit a simple linear model
## Hands-on Analysis ## Hands-on Analysis
- Open part_2.Rmd - Open part_2.Rmd
- If you just want to follow along and not run code, open part2_filled_out.html
@ -263,63 +244,34 @@ read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
## General Tips ## General Tips
- Follow any relevant institutional guidelines on using LLMs
- Always confirm ChatGPT's outputs are correct - Always confirm ChatGPT's outputs are correct
- Provide as much detail as possible about the problem in the 1st prompt - Provide as much detail as possible about the problem in the 1st prompt
- Use separate chats for separate tasks/projects - Use separate chats for separate tasks/projects
- Try the 'Custom Instructions' function that adds additional information to every prompt - Try the 'Custom Instructions' function
- Can visit webpages (GPT 4 only), which can help get more specific answers
## Code Tips ## Code Tips
- Commented R code yields better responses in my experience - Commented R code yields better responses
- Provide the code and error message in the same prompt - Provide the code and error message in the same prompt
- ChatGPT can work well to convert syntax and improve your code: - ChatGPT can work well to convert syntax and improve your code:
- "Turn this loop into a function : [your code]" - "Turn this loop into a function : [your code]"
- "Is there a better way to do this : [your code]" - "Is there a better way to do this : [your code]"
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case - Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
# Finding R Packages
## Key Questions
- What assay was the package designed for?
- When was the last release?
- Is it maintained (frequent updates)?
- Does it work on all operating systems?
- Are other people using it? (citations)
- Do they respond to github issues?
- Is there a benchmarking paper?
## BioConductor and CRAN
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
- Prefer BioConductor packages if available over CRAN
- Prefer CRAN packages over ones only hosted on GitHub
## Start with the Assay
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
- Pick the assay you want to analyse
- Pick the type of analysis you want to do
- Find a package that does it
- Find benchmarking papers to narrow the list of packages down
- Find the vignette on the package page and refer to the manual for any questions not covered by it
# Additional Resources # Additional Resources
## R ## R
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [R for Data Science](https://r4ds.hadley.nz/) - [R for Data Science](https://r4ds.hadley.nz/)
- [Top 10 R Errors and How to Fix them](https://statsandr.com/blog/top-10-errors-in-r/)
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/) - [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
- [Advanced R](https://adv-r.hadley.nz/) - [Advanced R](https://adv-r.hadley.nz/)
## Statistics ## Statistics
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science* - [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
@ -346,10 +298,9 @@ read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
## Upcoming Workshops ## Upcoming Workshops
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0) [Single Cell ATAC-Seq Data Analysis Part 2](https://gladstone.org/events/single-cell-atac-seq-data-analysis-part-2-1)
- Jan 25, 2024 (Session 1 - 10am12pm) (Session 2 - 1pm3pm)
- Jan 26, 2024 (Session 3 - 10am12pm) - Check [this link](https://gladstone.org/events?series=data-science-training-program) at the end of the summer for out fall workshop schedule
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4)
- Feb 1, 2024 (9:30am-12:00pm)

View file

@ -22,7 +22,7 @@
}, },
"MASS": { "MASS": {
"Package": "MASS", "Package": "MASS",
"Version": "7.3-60", "Version": "7.3-60.0.1",
"Source": "Repository", "Source": "Repository",
"Repository": "CRAN", "Repository": "CRAN",
"Requirements": [ "Requirements": [
@ -33,11 +33,11 @@
"stats", "stats",
"utils" "utils"
], ],
"Hash": "a56a6365b3fa73293ea8d084be0d9bb0" "Hash": "b765b28387acc8ec9e9c1530713cb19c"
}, },
"Matrix": { "Matrix": {
"Package": "Matrix", "Package": "Matrix",
"Version": "1.6-1.1", "Version": "1.6-5",
"Source": "Repository", "Source": "Repository",
"Repository": "CRAN", "Repository": "CRAN",
"Requirements": [ "Requirements": [
@ -50,7 +50,7 @@
"stats", "stats",
"utils" "utils"
], ],
"Hash": "1a00d4828f33a9d690806e98bd17150c" "Hash": "8c7115cd3a0e048bda2a7cd110549f7a"
}, },
"R6": { "R6": {
"Package": "R6", "Package": "R6",
@ -872,7 +872,7 @@
}, },
"lattice": { "lattice": {
"Package": "lattice", "Package": "lattice",
"Version": "0.21-9", "Version": "0.22-6",
"Source": "Repository", "Source": "Repository",
"Repository": "CRAN", "Repository": "CRAN",
"Requirements": [ "Requirements": [
@ -883,7 +883,7 @@
"stats", "stats",
"utils" "utils"
], ],
"Hash": "5558c61e0136e247252f5f952cdaad6a" "Hash": "cc5ac1ba4c238c7ca9fa6a87ca11a7e2"
}, },
"learnr": { "learnr": {
"Package": "learnr", "Package": "learnr",
@ -977,7 +977,7 @@
}, },
"mgcv": { "mgcv": {
"Package": "mgcv", "Package": "mgcv",
"Version": "1.9-0", "Version": "1.9-1",
"Source": "Repository", "Source": "Repository",
"Repository": "CRAN", "Repository": "CRAN",
"Requirements": [ "Requirements": [
@ -990,7 +990,7 @@
"stats", "stats",
"utils" "utils"
], ],
"Hash": "086028ca0460d0c368028d3bda58f31b" "Hash": "110ee9d83b496279960e162ac97764ce"
}, },
"mime": { "mime": {
"Package": "mime", "Package": "mime",
@ -1033,7 +1033,7 @@
}, },
"nlme": { "nlme": {
"Package": "nlme", "Package": "nlme",
"Version": "3.1-163", "Version": "3.1-164",
"Source": "Repository", "Source": "Repository",
"Repository": "CRAN", "Repository": "CRAN",
"Requirements": [ "Requirements": [
@ -1043,7 +1043,7 @@
"stats", "stats",
"utils" "utils"
], ],
"Hash": "8d1938040a05566f4f7a14af4feadd6b" "Hash": "a623a2239e642806158bc4dc3f51565d"
}, },
"openssl": { "openssl": {
"Package": "openssl", "Package": "openssl",

View file

@ -130,3 +130,13 @@ small {
max-width: 70%; max-width: 70%;
border: 1px solid black !important; border: 1px solid black !important;
} }
/* Chage link color to sky blue */
.reveal a {
color: #0c74dc;
}
/* Change link color to magenta on hover */
.reveal a:hover {
color: #9c0366 !important;
}