This commit is contained in:
Natalie Elphick 2024-05-18 08:18:12 -07:00
parent 094d41cd0d
commit fd2fc5b190
6 changed files with 1040 additions and 1502 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View file

@ -1,7 +1,8 @@
---
title: "Introduction to R Data Analysis - Part 1"
title: "Introduction to R Data Analysis"
subtitle: "Part 1"
author: "Natalie Elphick"
date: "January 22nd, 2024"
date: "May 20th, 2024"
knit: (function(input, ...) {
rmarkdown::render(
input,
@ -16,6 +17,7 @@ output:
```{r, setup, include=FALSE}
library(tidyverse)
knitr::opts_chunk$set(comment = "")
```
##
@ -25,10 +27,10 @@ library(tidyverse)
## Introductions
**Natalie Elphick**
Bioinformatician I
Bioinformatician I
**Michela Traglia (TA)**
Senior Statistician
**Yihang Xin (Online TA)**
Software Engineer III
## Poll 1
@ -36,7 +38,7 @@ Senior Statistician
**What is your level of experience with coding/data analysis?**
1. I know another data analysis programming language (Python, Matlab etc.)
2. I can use Excel to do linear regression
2. I can use Excel
3. I know some R
4. All of the above
5. None of the above
@ -50,8 +52,8 @@ Senior Statistician
1. What is R and why should you use it?
2. The RStudio interface
3. File types
4. Error messages
5. Variables
4. Variables
5. Error and warning messages
6. Types & data structures
7. Math and logic operations
8. Functions and packages
@ -86,9 +88,9 @@ functionality
# RStudio
## RStudio
- RStudio is an integrated development
- RStudio is an integrated development
environment (IDE)
- It makes R code easier to write by providing a
- It is an app that makes R code easier to write by providing a
feature rich graphical user interface (GUI)
<br>
@ -131,8 +133,6 @@ feature rich graphical user interface (GUI)
## Variable definition
- Variables store information that is referenced and manipulated
in a computer program
- In contrast to the mathematical definition of a variable,
variables in computer science are _mutable_
- There are 3 ways to define variables in R, but one is preferred:
```{r}
x <- 1 # Preferred way
@ -141,7 +141,59 @@ x = 1
print(x)
```
## Variable naming
## Example
- Run the following in the R console:
```{r}
x <- 1
y <- 4
z <- y
x + y + z
```
# Error and Warning Messages
## Errors
- **Errors**: Stop the execution of your code and must be fixed for the code to run successfully
```{r, eval=FALSE}
x <- 5
y <- 10
z <- x + a
```
```{r,echo=FALSE}
message("Error: object 'a' not found")
```
## Common Errors
- **Syntax Error:** Invalid R code syntax (e.g. misplaced parentheses)
```{r,echo=FALSE}
message('Error: unexpected ")"')
```
- **Object not found:** This variable is not defined (e.g. misspelled variables)
```{r,echo=FALSE}
message('Error: object "a" not found')
```
See this [article](https://statsandr.com/blog/top-10-errors-in-r/) for more common errors and how to fix them.
## Warnings
- Do not stop the execution but indicate potential issues that you should be aware of and might need to address
```{r}
a <- c(1, 2, 3, 4, 5)
b <- c(6, 7, 8, 9)
result <- a + b
```
## Variable Naming
- Variables names must start with a letter and can contain
underscores and periods
@ -176,7 +228,7 @@ DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
## Data Types
- Integer
- Whole numbers (in R denote with L ex. 1L,2L)
- Whole numbers (in R denoted with L ex. 1L,2L)
- Numeric
- Decimal numbers
- Logical
@ -191,7 +243,7 @@ DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
**Which of these is not the correct data type for the value?**
1. 1.5 - Numeric
2. "Labrador Retriever" - Character
2. "1" - Character
3. NA - Logical
4. 1 - Integer
@ -227,6 +279,8 @@ DogBreeds <- c("Labrador Retriever", "Akita", "Bulldog")
countdown::countdown(minutes = 10,
seconds = 0,
color_border = "black",
color_running_background = "#47d193",
color_finished_background = "#a3184e",
padding = "50px",
margin = "5%",
font_size = "5em",
@ -297,6 +351,7 @@ x & !y
execution of code
```{r}
dog_breeds <- c("Labrador Retriever", "Akita", "Bulldog")
if ("Akita" %in% dog_breeds) {
print("dog_breeds already contains Akita")
} else {
@ -379,11 +434,17 @@ library(ggplot2) # Makes all of the ggplot2 functions available
- The tidyverse is a collection of commonly used data analysis
packages
- Learning curve is less steep
- Lots of useful packages for data analysis
- Lots of useful packages for cleaning and "wrangling" data into the correct format
##
## Why use Tidyverse Packages?
![tidyverse](assets/tidyverse.png)
- Most of the work in data analysis is getting data into the correct format to create outputs
- The tidyverse collection of packages simplifies this process
- Intuitive syntax
- Comprehensive (data manipulation, cleaning, modeling and graphics)
- Consistent data structure
- Strong community support
# End of Part 1
@ -392,11 +453,8 @@ packages
## Upcoming Workshops
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0)
- Jan 25, 2024 (Session 1 - 10am12pm) (Session 2 - 1pm3pm)
- Jan 26, 2024 (Session 3 - 10am12pm)
[Single Cell ATAC-Seq Data Analysis Part 2](https://gladstone.org/events/single-cell-atac-seq-data-analysis-part-2-1)
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4)
- Feb 1, 2024 (9:30am-12:00pm)
- Check [this link](https://gladstone.org/events?series=data-science-training-program) at the end of the summer for out fall workshop schedule

View file

@ -1,7 +1,8 @@
---
title: "Introduction to R Data Analysis - Part 2"
title: "Introduction to R Data Analysis"
subtitle: "Part 2"
author: "Natalie Elphick"
date: "January 23rd, 2024"
date: "May 21st, 2024"
knit: (function(input, ...) {
rmarkdown::render(
input,
@ -19,6 +20,7 @@ library(kableExtra)
library(tidyverse)
library(readxl)
theme_set(theme_grey(base_size = 16))
knitr::opts_chunk$set(comment = "")
```
##
@ -29,7 +31,10 @@ theme_set(theme_grey(base_size = 16))
**Natalie Elphick**
Bioinformatician I
**Yihang Xin (TA)**
**Michela Traglia (In Person TA)**
Senior Statistician
**Yihang Xin (Online TA)**
Software Engineer III
# Schedule
@ -46,11 +51,11 @@ Software Engineer III
- The tidyverse packages work well together because they share
common data representations and design principles
- Rows = observations, columns = variables
- [ggplot2](), for data visualization.
- [dplyr](), for data manipulation.
- [tidyr](), for data tidying.
- [readr](), for data import.
- [purrr](), for iteration.
- [ggplot2](https://ggplot2.tidyverse.org/), for data visualization.
- [dplyr](https://dplyr.tidyverse.org/), for data manipulation.
- [tidyr](https://tidyr.tidyverse.org/), for data tidying.
- [readr](https://readr.tidyverse.org/), for data import.
- [purrr](https://purrr.tidyverse.org/), for iteration.
- and more..
## dplyr
@ -67,66 +72,38 @@ common data representations and design principles
## Example Dataframe
- mpg is a dataframe built into the ggplot2 package
```{r, eval = FALSE}
```{r}
head(mpg)
```
```{r, echo = FALSE}
head(mpg) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Select Columns
```{r, eval = FALSE}
```{r}
select(.data = mpg,
year, cty, hwy, manufacturer)
```
```{r, echo = FALSE}
select(.data = mpg,
year, cty, hwy, manufacturer) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Filter Rows
```{r, eval = FALSE}
```{r}
filter(.data = mpg,
year == 2008)
```
```{r, echo = FALSE}
filter(.data = mpg,
year == 2008) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Arrange Rows
- desc() is used to arrange rows in descending order, the default is ascending
```{r, eval = FALSE}
```{r}
arrange(.data = mpg,
desc(cyl))
desc(cty))
```
```{r, echo = FALSE}
arrange(.data = mpg,
desc(cyl)) |>
head(n = 3) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
```
## Summarising data
- The dplyr **summarise()** function computes a table of
summaries for a data frame
@ -136,6 +113,9 @@ variable(s)
different categorical groupings
## Group and Summarise
- Get the mean and median city mileage within manufacturer
```{r, eval = FALSE}
summarise(group_by(.data = mpg,
manufacturer),
@ -144,37 +124,27 @@ summarise(group_by(.data = mpg,
```
```{r, echo = FALSE}
summarise(group_by(.data = mpg,
summarise(.data = group_by(.data = mpg,
manufacturer),
mean_cty = mean(cty),
median_cty = median(cty)) |>
head() |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
head(10)
```
## The pipe operator |>
- Allows "chaining" of function calls to make code more readable
```{r, eval = FALSE}
mpg |>
group_by(manufacturer) |>
summarise(mean_cty = mean(cty),
median_cty = median(cty))
```
```{r, echo = FALSE}
```{r}
mpg |>
group_by(manufacturer) |>
summarise(mean_cty = mean(cty),
median_cty = median(cty)) |>
head(n = 4) |>
kable() |>
kable_styling("striped") |>
scroll_box(width = "100%")
head(5)
```
# Plotting
## ggplot2
@ -204,11 +174,10 @@ ggplot(data = mpg, # Input dataframe
## Adding and Modifying Layers
```{r, fig.dim=c(10,4)}
ggplot(data = mpg,
mapping = aes(x = class, y = cty, fill = class)) +
geom_violin() +
geom_boxplot(width = 0.1,
fill = "white")
ggplot(data = mpg,
mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_smooth(formula = y ~ x, method = "lm")
```
@ -221,6 +190,8 @@ ggplot(data = mpg,
countdown::countdown(minutes = 10,
seconds = 0,
color_border = "black",
color_running_background = "#47d193",
color_finished_background = "#a3184e",
padding = "50px",
margin = "5%",
font_size = "5em",
@ -234,8 +205,8 @@ countdown::countdown(minutes = 10,
## Dataset Description
- PanTHERIA
- A global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals compiled from the literature
- Macroecological and macroevolutionary research projects
- A global species-level data set of key traits of all known extant and recently extinct mammals compiled from literature
- Used in macroecological and macroevolutionary research projects
- Data is organized by taxonomic rank
## Taxonomic Rank
@ -252,10 +223,20 @@ read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
scroll_box(width = "100%")
```
## Hands-on Analysis
- We will read in the data and explore if the trophic level has a significant impact on the adult body mass of mammals
Steps:
1. Combine and clean the data
2. Visualize adult body mass by trophic level
3. Check for overrepresented groups
4. Fit a simple linear model
## Hands-on Analysis
- Open part_2.Rmd
- If you just want to follow along and not run code, open part2_filled_out.html
@ -263,63 +244,34 @@ read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
## General Tips
- Follow any relevant institutional guidelines on using LLMs
- Always confirm ChatGPT's outputs are correct
- Provide as much detail as possible about the problem in the 1st prompt
- Use separate chats for separate tasks/projects
- Try the 'Custom Instructions' function that adds additional information to every prompt
- Can visit webpages (GPT 4 only), which can help get more specific answers
- Try the 'Custom Instructions' function
## Code Tips
- Commented R code yields better responses in my experience
- Commented R code yields better responses
- Provide the code and error message in the same prompt
- ChatGPT can work well to convert syntax and improve your code:
- "Turn this loop into a function : [your code]"
- "Is there a better way to do this : [your code]"
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
# Finding R Packages
## Key Questions
- What assay was the package designed for?
- When was the last release?
- Is it maintained (frequent updates)?
- Does it work on all operating systems?
- Are other people using it? (citations)
- Do they respond to github issues?
- Is there a benchmarking paper?
## BioConductor and CRAN
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
- Prefer BioConductor packages if available over CRAN
- Prefer CRAN packages over ones only hosted on GitHub
## Start with the Assay
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
- Pick the assay you want to analyse
- Pick the type of analysis you want to do
- Find a package that does it
- Find benchmarking papers to narrow the list of packages down
- Find the vignette on the package page and refer to the manual for any questions not covered by it
# Additional Resources
## R
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [R for Data Science](https://r4ds.hadley.nz/)
- [Top 10 R Errors and How to Fix them](https://statsandr.com/blog/top-10-errors-in-r/)
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
- [Advanced R](https://adv-r.hadley.nz/)
## Statistics
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
@ -346,10 +298,9 @@ read_xlsx("Intro_to_R_workshop_materials/PanTHERIA.xlsx") |>
## Upcoming Workshops
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0)
- Jan 25, 2024 (Session 1 - 10am12pm) (Session 2 - 1pm3pm)
- Jan 26, 2024 (Session 3 - 10am12pm)
[Single Cell ATAC-Seq Data Analysis Part 2](https://gladstone.org/events/single-cell-atac-seq-data-analysis-part-2-1)
- Check [this link](https://gladstone.org/events?series=data-science-training-program) at the end of the summer for out fall workshop schedule
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4)
- Feb 1, 2024 (9:30am-12:00pm)

View file

@ -22,7 +22,7 @@
},
"MASS": {
"Package": "MASS",
"Version": "7.3-60",
"Version": "7.3-60.0.1",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
@ -33,11 +33,11 @@
"stats",
"utils"
],
"Hash": "a56a6365b3fa73293ea8d084be0d9bb0"
"Hash": "b765b28387acc8ec9e9c1530713cb19c"
},
"Matrix": {
"Package": "Matrix",
"Version": "1.6-1.1",
"Version": "1.6-5",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
@ -50,7 +50,7 @@
"stats",
"utils"
],
"Hash": "1a00d4828f33a9d690806e98bd17150c"
"Hash": "8c7115cd3a0e048bda2a7cd110549f7a"
},
"R6": {
"Package": "R6",
@ -872,7 +872,7 @@
},
"lattice": {
"Package": "lattice",
"Version": "0.21-9",
"Version": "0.22-6",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
@ -883,7 +883,7 @@
"stats",
"utils"
],
"Hash": "5558c61e0136e247252f5f952cdaad6a"
"Hash": "cc5ac1ba4c238c7ca9fa6a87ca11a7e2"
},
"learnr": {
"Package": "learnr",
@ -977,7 +977,7 @@
},
"mgcv": {
"Package": "mgcv",
"Version": "1.9-0",
"Version": "1.9-1",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
@ -990,7 +990,7 @@
"stats",
"utils"
],
"Hash": "086028ca0460d0c368028d3bda58f31b"
"Hash": "110ee9d83b496279960e162ac97764ce"
},
"mime": {
"Package": "mime",
@ -1033,7 +1033,7 @@
},
"nlme": {
"Package": "nlme",
"Version": "3.1-163",
"Version": "3.1-164",
"Source": "Repository",
"Repository": "CRAN",
"Requirements": [
@ -1043,7 +1043,7 @@
"stats",
"utils"
],
"Hash": "8d1938040a05566f4f7a14af4feadd6b"
"Hash": "a623a2239e642806158bc4dc3f51565d"
},
"openssl": {
"Package": "openssl",

View file

@ -129,4 +129,14 @@ small {
.big-picture img{
max-width: 70%;
border: 1px solid black !important;
}
}
/* Chage link color to sky blue */
.reveal a {
color: #0c74dc;
}
/* Change link color to magenta on hover */
.reveal a:hover {
color: #9c0366 !important;
}