finish up part 1 slides

This commit is contained in:
Natalie Elphick 2024-01-17 19:11:06 -08:00
parent 7794d7323e
commit d468ded0ee
36 changed files with 12432 additions and 43198 deletions

View file

@ -0,0 +1 @@
source("renv/activate.R")

View file

@ -0,0 +1,412 @@
---
title: "Introduction to R Data Analysis - Part 1"
author: "Natalie Elphick"
date: "January 22nd"
knit: (function(input, ...) {
rmarkdown::render(
input,
output_dir = "../docs"
)
})
output:
revealjs::revealjs_presentation:
theme: simple
css: style.css
---
```{r, setup, include=FALSE}
library(tidyverse)
```
##
<center>*Press the ? key for tips on navigating these slides*</center>
## Introductions
**Natalie Elphick**
Bioinformatician I
*Bioinformatics Core*
## Poll 1
**What is your level of experience with coding/data analysis?**
1. I am fluent in another data analysis programming language (Python, Matlab etc.)
2. I am use Excel to do linear regression
3. I know some R
4. All of the above
5. None of the above
## Part 1:
1. What is R and why should you use it?
2. The RStudio interface
3. File types
4. Error messages
5. Variables
6. Types & data structures
*10 min break*
7. Math and logic operations
8. Functions and libraries
9. Reading data into R
# What is R?
## R
- An open source language developed for statistical computing by **R**oss Ihaka and **R**obert Gentleman
- Inspired by the **S** language developed at Bell labs in 1976 to make interactive data analysis easier
- The first official version was released in 2000
## Why use R for data analysis?
- R is and will always be free
- Can easily implement any statistical analysis
- Code serves as a record which enables reproducibility
with minimal effort
- As of March 2023, there were over 19,000 open source packages to extend its
functionality
- Highly customizable graphics ([ggplot2](https://ggplot2-book.org/))
- Analysis reports ([knitr](https://cran.r-project.org/web/packages/knitr/index.html))
- RNA-seq analysis ([DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html))
## How does it work?
<section class="shrink">
![Programming](assets/R_lang_hierarchy.png)
</section>
# RStudio
## RStudio
- RStudio is an integrated development
environment (IDE)
- It makes R code easier to write by providing a
feature rich graphical user interface (GUI)
<br>
</br>
<section class="shrink">
![R and RStudio](assets/R_and_RStudio.png)
</section>
## Layout
![Layout](assets/RStudio_layout.png)
## File types
- **Rscript** files that end in `.R`
- The most basic, a file that contains R code
- **RMarkdown** files that end in `.Rmd`
- Let's create a blank Rscript to see how they work, open RStudio and click:
- File -\> New File -\> R Script
## R Markdown
- A file format combining `R` code with [Markdown](https://www.markdownguide.org/basic-syntax/) for text formatting.
- Designed for creating reproducible research reports in various formats (HTML, PDF, Word).
- Let's create an `Rmd` file in `RStudio` to explore the basics of how they work:
- File -\> New File -\> R Markdown
## R Markdown Advanced Usage
- **Presentations:** Creating slides (like these) with [revealjs](https://github.com/rstudio/revealjs).
- **Publications:** Authoring online books that combine narrative, code, and output with [bookdown](https://bookdown.org/).
- **Interactive Documents:** Developing interactive tutorials or dashboards with [learnR](https://rstudio.github.io/learnr/) and other embedded applications.
# Variables
## Variable definition
- Variables store information that is referenced and manipulated
in a computer program
- In contrast to the mathematical definition of a variable,
variables in computer science are _mutable_
- There are 3 ways to define variables in R, but one is preferred:
```{r}
x <- 1 # Preferred way
x = 1
1 -> x
print(x)
```
## Variable naming
- Variables names must start with a letter and can contain
underscores and periods
- It is best practice to use descriptive variable names and stick
to one style of names
```{r}
# Snake case
dog_breeds <- c("Labrador Retriever","Akita", "Bulldog")
# Period separated
dog.breeds <- c("Labrador Retriever","Akita", "Bulldog")
# Camel case
DogBreeds <- c("Labrador Retriever","Akita", "Bulldog")
```
## Poll 2
**Which variable name is not valid in R?**
1. cat_dog
2. CatDOG
3. cat.dog
4. catD0g
## Excercise 1
- Open Rscript file part_1.R in Rstudio
# Data Types and Structures
## Data Types
- Integer
- Whole numbers
- Numeric
- Decimal numbers
- Logical
- Boolean (TRUE, FALSE)
- NA
- Character
- Letters and strings of letters
- "A", "Labrador Retriever"
## Data Structures
- Vectors
- Atomic vectors - one dimensional lists that store values of
the **same type**
- Lists - can be multidimensional and contain **different types/structures** (ex. nested lists)
- Factors
- Ordered list with assigned levels
- Matrix
- Columns and rows of the **same type**
- Data frames
- Columns and rows of **mixed types**
##
![Data structures](assets/data structures.png)
## Exercise 2: Data Types and Structures
- Reopen Rscript file part_1.R in Rstudio
# 10 min break
<center>
```{r, echo=FALSE}
countdown::countdown(minutes = 10,
seconds = 0,
color_border = "black",
padding = "50px",
margin = "5%",
font_size = "5em",
style = "position: relative; width: min-content;")
```
</center>
# Math and Logic Operations
## Math & Logic
- Built in functions to get common mathematical summaries of data (eg. mean( ), median( ), mode( ) )
- Relational comparison operators to compare values
```{r, eval=FALSE}
x == y # Equal to
x != y # Not equal to
x < y # Less than
x > y # Greater than
x <= y # Less than or equal to
x >= y # Greater than or equal to
x %in% y # Is x in this vector y?
```
## Logical Operators
- Logical operators can compare TRUE or FALSE values
```{r, eval=FALSE}
x <- TRUE
y <- FALSE
!x # Not x
x | y # x or y
x & y # x and y
```
## Conditional execution
- Relational and logical operations allow for conditional
execution of code
```{r}
if ("Akita" %in% dog_breeds) {
print("dog_breeds already contains Akita")
} else {
dog_breeds <- c("Akita", dog_breeds)
}
```
# Functions
## Functions
- A function is block of organized, reusable code that is used to
perform a single action
- R has many built in functions, these are called **base R** functions
- Not all arguments are required and some have default values
![Functions](assets/functions.png)
## Defining a function
- To define a function we use the function keyword, the output is specified with the **return** keyword:
```{r}
add_dog <- function(dog_to_add,
input_vector) {
if (dog_to_add %in% input_vector) {
message("Already contains this dog")
} else {
output <- c(dog_to_add, input_vector)
return(output)
}
}
```
## Example
```{r}
add_dog(dog_to_add = "Akita",
input_vector = dog_breeds)
```
```{r}
add_dog(dog_to_add = "German Shepard",
input_vector = dog_breeds)
```
# Packages
## Packages
- Packages are collections of functions that are specialized to a specific task (plotting, data manipulation etc.)
- The tidyverse is a collection of commonly used data analysis
packages
- Learning curve is less steep
- Lots of useful packages for data analysis
##
![tidyverse](assets/tidyverse.png)
# End of Part 1
## Workshop survey
- Please fill out our [workshop survey](https://www.surveymonkey.com/r/F75J6VZ) so we can continue to improve these workshops
## Upcoming Workshops
1. [Introduction to Statistics, Experimental Design, and Hypothesis Testing](https://gladstone.org/index.php/events/introduction-statistics-experimental-design-and-hypothesis-testing-0)
- Jan 25, 2024 (Session 1 - 10am12pm) (Session 2 - 1pm3pm)
- Jan 26, 2024 (Session 3 - 10am12pm)
2. [Intermediate RNA-Seq Analysis Using R](https://gladstone.org/index.php/events/intermediate-rna-seq-analysis-using-r-4)
- Feb 1, 2024 (9:30am-12:00pm)
# ChatGPT Tips for R
## General Tips
- Always confirm ChatGPT's outputs are correct
- Provide as much detail as possible about the problem in the 1st prompt
- Use separate chats for separate tasks/projects
- Try the 'Custom Instructions' function that adds additional information to every prompt
- Can visit webpages (GPT 4 only), which can help get more specific answers
## Code Tips
- Commented R code yields better responses in my experience
- Provide the code and error message in the same prompt
- ChatGPT can work well to convert syntax and improve your code:
- "Turn this loop into a function : [your code]"
- "Is there a better way to do this : [your code]"
- Check out the file: `example_code/1_convert_syntax_example.R` for an example use case
# Finding R Packages
## Key Questions
- What assay was the package designed for?
- When was the last release?
- Is it maintained (frequent updates)?
- Does it work on all operating systems?
- Are other people using it? (citations)
- Do they respond to github issues?
- Is there a benchmarking paper?
## BioConductor and CRAN
- Both of these have stringent requirements for packages they host (eg. for BioConductor they have to run on all major operating systems)
- Prefer BioConductor packages if available over CRAN
- Prefer CRAN packages over ones only hosted on GitHub
## Start with the Assay
- Click [here](https://www.bioconductor.org/packages/release/BiocViews.html#___Sequencing) to go to BioC views
- Pick the assay you want to analyse
- Pick the type of analysis you want to do
- Find a package that does it
- Find benchmarking papers to narrow the list of packages down
- Find the vignette on the package page and refer to the manual for any questions not covered by it
# Additional Resources
## R
- [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/how-to-read-this-book.html) : Excellent R markdown reference
- [R for Data Science](https://r4ds.hadley.nz/)
- [ggplot2: elegant graphics for data analysis](https://ggplot2-book.org/)
- [Advanced R](https://adv-r.hadley.nz/)
## Statistics
- [Data Analysis in R](https://bookdown.org/steve_midway/DAR) : This book has more statistics details than *R for Data Science*
- [Generalized Linear Models](https://bookdown.org/steve_midway/DAR/glms-generalized-linear-models.html)\
- [Random Effects](https://bookdown.org/steve_midway/DAR/random-effects.html)
## RNA-seq Analysis
- [RNA-seqlopedia](https://rnaseq.uoregon.edu/) : Everything you need to know about RNA-seq experiments
- [RNA-seq Expression Units](https://luisvalesilva.com/datasimple/rna-seq_units.html) : Blog post on understanding common units
- [Introduction to Single-Cell Analysis with Bioconductor](https://bioconductor.org/books/3.17/OSCA.intro/index.html) : Covers the basics of scRNA-seq analysis in R
## Dimensional Reduction
- [Tutorial on PCA](https://uw.pressbooks.pub/appliedmultivariatestatistics/chapter/pca/) : PCA explained with R code examples
- [Understanding UMAP](https://pair-code.github.io/understanding-umap/) : Short explanation with great visualizations, mainly useful for scRNA-seq analysis

View file

@ -0,0 +1,118 @@
# Converting loops to map with ChatGPT ------------------------------------
# Author: Natalie Elphick
# ** Important: **
# Before running this code, set your working directory to the same location as this file:
# Click Session -> Set Working Directory -> To Source File Location
# Run the code line by line with Ctrl+Enter or Cmd+Enter on Mac
# First we will create a simulated example and then use ChatGPT to convert the
# loop into tidyverse style code that uses map()
library(tidyverse)
# Gene names
gene_names <- paste("Gene", 1:10)
# Simulate some data
data <- data.frame(
Gene = rep(gene_names, each = 20),
Treatment = rep(rep(c("Treatment", "Control"), each = 10), times = length(gene_names)),
Sample = rep(1:20, times = length(gene_names)),
Log_CPM = c(
rnorm(10, mean = 5, sd = 1),
rnorm(10, mean = 6, sd = 1)
) + rnorm(20 * length(gene_names), mean = 0, sd = 1) # Add sample-specific variation
)
# Create a directory to save plots
dir.create("gene_boxplots", showWarnings = FALSE)
# Iterate through gene names and create boxplots
for (gene_name in gene_names) {
# Subset data for the current gene
gene_data <- data[data$Gene == gene_name, ]
# Create a boxplot comparing treatment vs. control for Log CPM using ggplot2
plot_filename <- paste("gene_boxplots/", gene_name, "_boxplot.png", sep = "")
p <- ggplot(data = gene_data, aes(x = Treatment, y = Log_CPM, fill = Treatment)) +
geom_boxplot() +
labs(
title = paste("Log CPM for", gene_name),
x = "Treatment",
y = "Log CPM"
) +
scale_fill_manual(values = c("#009E73", "#D55E00")) +
theme_minimal()
ggsave(plot = p,plot_filename, width = 6, height = 4)
# Print the plot filename
cat("Boxplot saved as:", plot_filename, "\n")
}
# Converting the syntax ---------------------------------------------------
# Paste the following prompt into ChatGPT:
# Convert the following code into tidyverse style code that uses purr::map() :
# <paste in the code above>
# Put the result below and run it
# Here is what I got ------------------------------------------------------
library(purrr) # *This part was incorrect, it should still be library(tidyverse)*
# Gene names
gene_names <- paste("Gene", 1:10)
# Simulate some data
data <- tibble(
Gene = rep(gene_names, each = 20),
Treatment = rep(rep(c("Treatment", "Control"), each = 10), times = length(gene_names)),
Sample = rep(1:20, times = length(gene_names)),
Log_CPM = c(
rnorm(10, mean = 5, sd = 1),
rnorm(10, mean = 6, sd = 1)
) + rnorm(20 * length(gene_names), mean = 0, sd = 1) # Add sample-specific variation
)
# Create a directory to save plots
dir.create("gene_boxplots", showWarnings = FALSE)
# Function to create and save a boxplot for a given gene
save_gene_plot <- function(gene_name) {
gene_data <- data %>% filter(Gene == gene_name)
plot_filename <- paste("gene_boxplots/", gene_name, "_boxplot.png", sep = "")
p <- ggplot(data = gene_data, aes(x = Treatment, y = Log_CPM, fill = Treatment)) +
geom_boxplot() +
labs(
title = paste("Log CPM for", gene_name),
x = "Treatment",
y = "Log CPM"
) +
scale_fill_manual(values = c("#009E73", "#D55E00")) +
theme_minimal()
ggsave(plot = p, filename = plot_filename, width = 6, height = 4)
# Print the plot filename
cat("Boxplot saved as:", plot_filename, "\n")
}
# Iterate through gene names and create boxplots using map
map(gene_names, save_gene_plot)

View file

@ -0,0 +1,169 @@
# Exercise 1: Rscript Files -----------------------------------------------
# This is an Rscript file
# - It contains lines of code that can be executed using Ctrl+Enter
#
# - Lines that begin with # are called comments and are ignored by the compiler,
# these can be used to document what the code does
#
# - The whole file can be executed by pressing Source above
# *This will error unless you comment out line 49*
#
# - Refer back to the slides at the end of each Exercise
# Use Ctrl+Enter to execute each of these lines
1 + 5 # Addition
70 - 23 # Subtraction
64 / 8 # Division
8 * 8 # Multiplication
sqrt(25) # Square root
5 ^ 2 # Exponent
pi # Built in variable for pi
# Rstudio's tab completion feature is very useful, type "sq" and press tab
# **Check your understanding**
# Calculate the area of a circle with radius = 5cm
# End of Exercise 1 -------------------------------------------------------
# Exercise 2: Data Types and Structures -----------------------------------
##### 2.1 Data Types #####
# Different types and structures have can have different operations and
# functions done to them. Use class() to check the type of a variable
# Integer
x <- 4L
class(x)
# Numeric
y <- 1.5
class(y)
# When possible R will coerce types, for example:
class(x + y)
# It will not coerce all types though, this will throw an error
1 + "2"
# You can change the type of a variable using "as" functions
1 + as.numeric("2")
# What do you think the output of this line will be?
4 + TRUE
# What about this line?
class(c(TRUE, FALSE, TRUE, 4))
# NAs are a special case of logical, used most often to represent missing data
class(NA)
# Math operations with NAs return NA
2 + NA
mean(c(1,2,3,NA))
##### 2.2 Data Structures #####
# Defining an atomic vector
abc <- c("a", "b", "c")
# Accessing elements of a vector
abc[2]
# Range of elements
abc[1:2]
# Position of an element
which(abc == "b")
# Easily specify a vector of numbers
num <- 1:10
num
# Defining an unnamed list
abc <- list("a", "b", 3)
# Accessing elements of unnamed list
abc[[1]]
# Defining a named list
abc <- list(a = 1, b = "2", c = 3)
# Elements of a named list can be accessed 3 ways
abc$b # Name with dollar sign
abc[["b"]] # Name with brackets
abc[[2]] # Numeric index with brackets
# Range of elements
abc[2:3]
# Defining a matrix
numeric_matrix <- matrix(1:9, nrow = 3, ncol = 3)
numeric_matrix
# Elements are accessed using row,column indexes
numeric_matrix[2, 3]
numeric_matrix[2, ] # Whole 2nd row
numeric_matrix[1:2, ] # First 2 rows
# Rows and columns of a matrix can have names
numeric_matrix <- matrix(data = 1:9,
nrow = 3,
dimnames = list(
c("X", "Y", "Z"),
c("A", "B", "C")
)
)
numeric_matrix
# Elements can be accessed both by numeric index and column and row names
numeric_matrix[2, 3]
numeric_matrix["Y", "C"]
numeric_matrix[1:2, "C"]
# Defining a data frame
mixed_df <- data.frame(
numbers = 1:3,
letters = c("A", "B", "C"),
bool = c(TRUE, FALSE, NA)
)
mixed_df
# Accessing elements of a data frame
mixed_df$numbers # Returns the vector
mixed_df[["numbers"]]
mixed_df["letters"] # Returns the column as a single column data frame
# Count rows and columns in data frames or matrices
nrow(mixed_df)
ncol(mixed_df)
# Similar to matrices, elements in a data frame can also be accessed by
# combinations of numeric index and column and row names.
# **Check your understanding**
# Write 1 line of code that accesses only elements B, C, FALSE and NA of
# mixed_df
#
# Hint: line 134

View file

@ -0,0 +1,281 @@
---
title: "Intro to R Data Analysis: Part 2"
output: html_document # knitr report document type
date: "`r Sys.Date()`" # This will update the date everytime you knit the doc
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Load packages
library(dplyr) # tidyverse data frame manipulation package
library(tidyr) # functions to help clean data
library(magrittr) # this package provides the pipe operator %>%
library(readxl) # read excel files
library(ggplot2) # highly customizable plots
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. Guide to markdown syntax <https://www.markdownguide.org/basic-syntax/>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r}
# Simulates 100 observations from a normal distribution
# and plots a histogram
val <- rnorm(n = 100)
hist(val, breaks = 20)
```
```{r}
# The ggplot version of the same plot
ggplot(data = tibble(values = val),
mapping = aes(x = values))+
geom_histogram(bins = 20)
```
**Important: before running the code below, click Session -> Set Working Directory -> To Source File Location**
## Exercise 3: Reading in Data
The data we will be analyzing is from the PanTHERIA database which is "a global species-level data set of key life-history, ecological and geographical traits of all known extant and recently extinct mammals (PanTHERIA) developed for a number of macroecological and macroevolutionary research projects."
```{r}
# The data is spread across 3 sheets in an excel file. We need to
# combine these data into one table/data frame.
# na = "NA" tells read_xlsx how missing values appear in the data
# the default is empty cells. Run "?read_xlsx" for more info
sheet1 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 1, na = "NA")
sheet2 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 2, na = "NA")
sheet3 <- read_xlsx(path = "PanTHERIA.xlsx", sheet = 3, na = "NA")
colnames(sheet1) == colnames(sheet2) & colnames(sheet2) == colnames(sheet3)
# rbind (row-bind) combines data frames by row
pantheria <- rbind(sheet1, sheet2, sheet3)
```
```{r}
# How many rows and columns are there?
nrow(pantheria)
ncol(pantheria)
```
```{r}
# What does the data look like?
head(pantheria)
```
## Exercise 4: Filtering and Reformatting Data
We will exploring adult body mass from these mammals as it relates to their trophic level using `dpylr` and `ggplot2`. Download the cheatsheets for these packages at the following links:
* [dplyr cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf)
* [ggplot2 cheatsheet](https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf)
Let's start by subsetting the data with `select()`
```{r}
# Pipes (%>%) work by passing the data in front of the pipe to the first argument
# of the function after it, this prevents a lot of nested function calls and makes
# code easier to read.
pantheria <- pantheria %>%
select(Order,
Family,
Genus,
Species,
TrophicLevel,
AdultBodyMass_g) %>%
drop_na() %>%
distinct()
pantheria <- select(pantheria,
Order,
Family,
Genus,
Species,
TrophicLevel,
AdultBodyMass_g)
pantheria <- drop_na(pantheria)
pantheria <- distinct(pantheria)
```
Data is almost never clean, for example there should be only 3 trophic levels:
```{r}
# unique elements of a vector
unique(pantheria$TrophicLevel)
```
Let's fix the TrophicLevel column using `mutate()`
```{r}
# mutate allows us to add columns or modify existing ones
pantheria <- mutate(pantheria,TrophicLevel = tolower(TrophicLevel))
pantheria %>%
mutate(AdultBodyMass_g = gsub("\\.", "-",AdultBodyMass_g))
```
## Exercise 5: Summarizing data
Now we can summarize the adult body mass by trophic level by computing standard metrics like mean and standard deviation.
```{r}
pantheria %>%
group_by(TrophicLevel) %>%
summarize(Mean = mean(AdultBodyMass_g),
Median = median(AdultBodyMass_g),
`Standard Deviation` = sd(AdultBodyMass_g),
Min = min(AdultBodyMass_g),
Max = max(AdultBodyMass_g))
```
## Exercise 6: Plotting Data
According to the table above, body masses have a really wide range across trophic levels. Let's visualize the distribution of adult body masses.
```{r}
# ggplot2 constructs graphics in layers, each layer is separated by "+"
# x and y values are supplied in aes(), the type of plot is specified using
# the "geom" functions
ggplot(data = pantheria, mapping = aes(x = log10(AdultBodyMass_g))) +
geom_histogram(bins = 50, fill ="#CE3274") +
xlab(label = "log10 Adult Body Mass") +
ylab(label = "Frequency") +
ggtitle(label = "Histogram of Adult Body Mass") +
theme_classic() +
theme(plot.title = element_text(hjust = .5, face = "bold"),
axis.title.x = element_text(face = "bold"))
```
The data looks skewed even after log10 transformation. Let's view the distribution by trophic level.
```{r}
pantheria %>%
ggplot(mapping = aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) +
geom_histogram(bins = 50) +
facet_grid(rows = vars(TrophicLevel)) +
xlab(label = "log10 Adult Body Mass") +
ylab(label = "Frequency") +
ggtitle(label = "Histogram of Adult Body Mass") +
theme_classic() +
theme(plot.title = element_text(hjust = .5))
```
It is clear that trophic level does have an impact on the distribution of adult body mass, carnivores tend to be smaller which makes sense because carnivores have higher metabolic demands and so there might be a selection pressure towards smaller carnivores. If we wanted to confirm this by fitting a model, we could use the `lm()` function to fit a linear model.
## Exercise 7.1: Hands on coding
An important caveat of the data is that some Orders of mammals are more biodiverse than others and are therefore over represented in the dataset. Using the `dplyr` cheatsheet, write code generates a table of Orders and what percentage of the data they are. Scroll down to see the hint if you are having trouble.
```{r}
# Your code
pantheria %>%
group_by(Order) %>%
summarise(n = (n()/nrow(pantheria)) * 100) %>%
arrange(desc(n))
```
*Exercise 7.1 hint: group_by + summarize(n = n()) + arrange*
## Exercise 7.2: Hands on coding
Now that we see what the over represented Orders are, we can plot their body masses by trophic level to see if they are skewing the overall distributions.
```{r}
# Character vector of the top 2 Orders from above
top_orders <- c("Rodentia", "Chiroptera")
# filter uses a conditional to select rows from the data
pantheria %>%
filter(Order %in% top_orders) %>%
ggplot(mapping = aes(x = log10(AdultBodyMass_g), fill = Order)) +
geom_histogram(bins = ) +
facet_grid(rows = vars(TrophicLevel),
cols = vars(Order))
```
It looks like one of them is mostly made up of small carnivores. Let's remove it and redo the plot of body mass distribution by trophic level.
```{r}
pantheria %>%
filter(Order != "Chiroptera") %>%
ggplot(mapping = aes(x = log10(AdultBodyMass_g),fill = TrophicLevel)) +
geom_histogram(bins = 40 ) +
facet_grid(rows = vars(TrophicLevel))
pantheria %>%
ggplot(mapping = aes(x = log10(AdultBodyMass_g),fill = TrophicLevel))
```
We can see now that body mass of carnivorous mammals is much less skewed than the initial plots show. There is still an effect of trophic level on body mass, but the effect size is likely much smaller than we would estimate by including all `r sum(pantheria$Order == "Chiroptera")` *Chiroptera*. Now that we have generated these plots, we can generate a full report that contains all of the text and code, click `Knit` to render the HTML report.
## End of workshop exercises
Hopefully this workshop has provided a good foundation for you to learn R. If you would like some additional practice, check out the resources on the [workshop wiki](https://github.com/gladstone-institutes/Bioinformatics-Workshops/wiki/Introduction-to-R-for-Data-Analysis). R also contains many built in datasets you can use for practice:
```{r, echo=FALSE}
available_datasets <- data()
available_datasets$results %>%
as_tibble() %>%
select(-LibPath) %>% knitr::kable()
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 281 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 111 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 737 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 223 KiB

View file

@ -0,0 +1,13 @@
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX

View file

@ -0,0 +1 @@
*.html

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

View file

@ -0,0 +1,86 @@
---
title: "Pre-Workshop Setup"
tutorial:
id: "intro-r-data-analysis_lesson_0"
version: 1.0
output:
learnr::tutorial:
theme: "lumen"
progressive: true
allow_skip: true
runtime: shiny_prerendered
description: >
Learn how to set up R and RStudio on your machine. We will also demonstrate
how to install R packages from CRAN, and install the tidyverse package.
---
```{r setup, include=FALSE}
library(learnr)
learnr::tutorial_options(exercise.timelimit = 10)
```
## Introduction
This guide will help you get set up for <ins>Intro to R Data Analysis</ins>. There are just a few steps to make sure you'll have the necessary software installed and ready to go on day 1. **Please ensure that you've completed each step by running the validation test prior to the start of the workshop**.
This guide will walk you through how to install R, RStudio, and some additional tools that well be using in the course. By rough analogy to a car, R is like the cars engine and RStudio is like the dashboard. More precisely, R is a programming language and Rstudio is an integrated development environment (IDE), which is basically a nice software interface for interacting with R. For our purposes, you will only ever interact directly with RStudio, but it needs to have R installed to work (like a car needing its engine).
Please complete the following steps (must be done in this order). If you already have R and Rstudio installed you can skip ahead. Make sure you complete step 5 though!
1. [Install R](#install-r)
2. [Install RStudio](#install-rstudio)
3. [Check you have recent versions of R](#check-you-have-a-recent-version-of-r)
4. [Install required packages](#install-required-packages)
5. [Run verification test](#run-verification-test)
Please consult the links provided for additional tips, and feel free to reach out for help by email [me](mailto:natalie.elphick@gladstone.ucsf.edu) if you get stuck.
## Install R
Please watch this quick video guide on how to install R, and then use the link below.
**NOTE MacOS users: With the new MacOS updates, updating R might require you to re-install your packages. While not in issue for many people, you have been warned**
![](https://vimeo.com/203516510)
https://cloud.r-project.org/
## Install RStudio
Please watch this quick video on how to install Rstudio, and then use the link below.
![](https://vimeo.com/203516968)
https://www.rstudio.com/products/rstudio/download/
## Install Required Packages
Many of the tools we will want to use do not come prepackaged with R, but rather need to be installed as packages. There are a few key packages we will be using. Watch the following video on how to install packages in Rstudio:
![](https://vimeo.com/203516241)
You can also refer to [this site](https://moderndive.netlify.app/1-getting-started.html#packages) for more info on what packages are and how to install them (including a GUI installation method if you prefer that).
Please install the [`tidyverse`](https://www.tidyverse.org/) R package, which well be relying on extensively throughout the course. No need to worry about what exactly this is yet, but you can read more [here](https://www.tidyverse.org/) if you like.
## Run verification test
Now its time to make sure you have everything installed properly!
First, open Rstudio (remember, you want to open Rstudio, not R). You should see a window that looks something like this:
<center>
<img src="images/rstudio_screenshot.png" width="40%">
</center>
The pane with the > symbol is the Console. This is where you enter R commands.
Copy and paste the following code into your Rstudio console and hit return.
```{r, eval=F}
R_version <- as.numeric(R.version['major']$major)
if (R_version >= 4) {
library(ggplot2)
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point()
} else {
print('R version is too old')
}
```
You should see a plot that looks like this appear:
<center>
<img src="images/ggplot_output.png" width="40%">
</center>
If you see an error that says “R version is too old” that means you need to update your R version. The update process is the same as the installation process. It will update your R installation. If you see an error that says “There is no package called ggplot2” that means you need to install the tidyverse package (see above).

View file

@ -0,0 +1,10 @@
name: lesson_0
title:
username: natalie-elphick
account: natalie-elphick
server: shinyapps.io
hostUrl: https://api.shinyapps.io/v1
appId: 10952169
bundleId: 8154915
url: https://natalie-elphick.shinyapps.io/lesson_0/
version: 1

File diff suppressed because it is too large Load diff

7
intro-r-data-analysis/renv/.gitignore vendored Normal file
View file

@ -0,0 +1,7 @@
library/
local/
cellar/
lock/
python/
sandbox/
staging/

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,19 @@
{
"bioconductor.version": null,
"external.libraries": [],
"ignored.packages": [],
"package.dependency.fields": [
"Imports",
"Depends",
"LinkingTo"
],
"ppm.enabled": null,
"ppm.ignored.urls": [],
"r.version": null,
"snapshot.type": "implicit",
"use.cache": true,
"vcs.ignore.cellar": true,
"vcs.ignore.library": true,
"vcs.ignore.local": true,
"vcs.manage.ignores": true
}

View file

@ -1,34 +0,0 @@
#This script perform some basic calculations in R.
#To run this script you may select all and hit the Run button on top right of this pane ...
#... or go Cmd+A followed by Cmd+Enter on Mac. If you use Windows, ...
#... you can also go Ctrl+A followed by Ctrl+Enter.
#The following command will add 2 to 3 and store the value in variable named 'a'.
a <- sum(2,3)
#The following command will get the product of 2 and 3 and store it in 'b'.
b <- prod(2,3)
#Next, we check if a and b have equal values.
a == b
#Next, we check if a and b are not equal.
a != b
#Conclusion: Summing numbers is not the same as multiplying them!
#Time to write a paper? I think we can go to Nature or Science with this discovery.
#Check if two conditions are simultaneously true.
(a == b) & (sqrt(3) == 5)
#Check if any one of given two conditions are true.
#Vertical line is how we say 'or' in R.
(a == b) | (sqrt(3) == 5)
#Variables can also store characters.
name <- "Homo sapiens"
#Is Homo sapiens equal to human being?
name == "Human being"
#Apparently not?

View file

@ -1,50 +0,0 @@
#Vectors.
#This is a numeric vector.
a <- c(3, 5, 2.5, 0, 9)
mean(a)
#Vectors (as well as other data objects) may contain NA.
#NA stands for Not Available.
a <- c(3, 5, 6.8, NA)
mean(a)
#To ignore NA while taking mean.
mean(a, na.rm = TRUE)
#Vectors are ordered sequences.
#This is a character vector.
b <- c("Homo sapiens", "Martians", "Blue.Whales", "Homo sapiens")
table(b)
#----------
#Data frame
#Tabular data possibly with mix of numeric, character, factor or logical type entries.
#Example: Iris setosa data that we worked with.
df <- data.frame(a = 1:4,
name = b)
#---------
#Matrix
#Matrices are tabular like data frames but store only one type of entries.
df_matrix <- as.matrix(df)
#May convert between data structures.
df_numeric <- as.numeric(df_matrix)
df_character <- as.character(df_numeric)
#----------
#List
#Lists are flexible data structures.
#Lists have named fields which can contain arbitrary data—vectors, other lists, strings, functions, and anything else.
#We may wish to keep related data together in one place.
#Say you search for a sequence in genome using an R library.
#Your ideal output might have mixed entries. For example,
#if sequence is found on a chromosome, you want table containing start and end loci on the
#chromosome and matched length.
#If not found, you want a string that says "Not found."
result <- list(chr1 = data.frame(Start = c(5, 100, 200),
End = c(70, 150, 230),
Length = c(66, 50, 30)
),
chr2 = "Not found."
)

View file

@ -1,24 +0,0 @@
>J01859.1 Escherichia coli 16S ribosomal RNA, complete sequence
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGT
AACAGGAAGAAGCTTGCTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATG
GAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCG
GGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACG
ATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGG
CAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTT
CGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCG
CAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAAT
TACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAAC
TGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGT
AGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCG
TGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCC
TTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACT
CAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCT
TACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGC
TGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCT
TTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGA
CGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGA
CCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATG
AAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCG
CCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTT
TGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTT
A

File diff suppressed because it is too large Load diff

View file

@ -1,151 +0,0 @@
"Sepal.Length";"Sepal.Width";"Petal.Length";"Petal.Width";"Species"
"1";5.1;3.5;1.4;0.2;"setosa"
"2";4.9;3;1.4;0.2;"setosa"
"3";4.7;3.2;1.3;0.2;"setosa"
"4";4.6;3.1;1.5;0.2;"setosa"
"5";5;3.6;1.4;0.2;"setosa"
"6";5.4;3.9;1.7;0.4;"setosa"
"7";4.6;3.4;1.4;0.3;"setosa"
"8";5;3.4;1.5;0.2;"setosa"
"9";4.4;2.9;1.4;0.2;"setosa"
"10";4.9;3.1;1.5;0.1;"setosa"
"11";5.4;3.7;1.5;0.2;"setosa"
"12";4.8;3.4;1.6;0.2;"setosa"
"13";4.8;3;1.4;0.1;"setosa"
"14";4.3;3;1.1;0.1;"setosa"
"15";5.8;4;1.2;0.2;"setosa"
"16";5.7;4.4;1.5;0.4;"setosa"
"17";5.4;3.9;1.3;0.4;"setosa"
"18";5.1;3.5;1.4;0.3;"setosa"
"19";5.7;3.8;1.7;0.3;"setosa"
"20";5.1;3.8;1.5;0.3;"setosa"
"21";5.4;3.4;1.7;0.2;"setosa"
"22";5.1;3.7;1.5;0.4;"setosa"
"23";4.6;3.6;1;0.2;"setosa"
"24";5.1;3.3;1.7;0.5;"setosa"
"25";4.8;3.4;1.9;0.2;"setosa"
"26";5;3;1.6;0.2;"setosa"
"27";5;3.4;1.6;0.4;"setosa"
"28";5.2;3.5;1.5;0.2;"setosa"
"29";5.2;3.4;1.4;0.2;"setosa"
"30";4.7;3.2;1.6;0.2;"setosa"
"31";4.8;3.1;1.6;0.2;"setosa"
"32";5.4;3.4;1.5;0.4;"setosa"
"33";5.2;4.1;1.5;0.1;"setosa"
"34";5.5;4.2;1.4;0.2;"setosa"
"35";4.9;3.1;1.5;0.2;"setosa"
"36";5;3.2;1.2;0.2;"setosa"
"37";5.5;3.5;1.3;0.2;"setosa"
"38";4.9;3.6;1.4;0.1;"setosa"
"39";4.4;3;1.3;0.2;"setosa"
"40";5.1;3.4;1.5;0.2;"setosa"
"41";5;3.5;1.3;0.3;"setosa"
"42";4.5;2.3;1.3;0.3;"setosa"
"43";4.4;3.2;1.3;0.2;"setosa"
"44";5;3.5;1.6;0.6;"setosa"
"45";5.1;3.8;1.9;0.4;"setosa"
"46";4.8;3;1.4;0.3;"setosa"
"47";5.1;3.8;1.6;0.2;"setosa"
"48";4.6;3.2;1.4;0.2;"setosa"
"49";5.3;3.7;1.5;0.2;"setosa"
"50";5;3.3;1.4;0.2;"setosa"
"51";7;3.2;4.7;1.4;"versicolor"
"52";6.4;3.2;4.5;1.5;"versicolor"
"53";6.9;3.1;4.9;1.5;"versicolor"
"54";5.5;2.3;4;1.3;"versicolor"
"55";6.5;2.8;4.6;1.5;"versicolor"
"56";5.7;2.8;4.5;1.3;"versicolor"
"57";6.3;3.3;4.7;1.6;"versicolor"
"58";4.9;2.4;3.3;1;"versicolor"
"59";6.6;2.9;4.6;1.3;"versicolor"
"60";5.2;2.7;3.9;1.4;"versicolor"
"61";5;2;3.5;1;"versicolor"
"62";5.9;3;4.2;1.5;"versicolor"
"63";6;2.2;4;1;"versicolor"
"64";6.1;2.9;4.7;1.4;"versicolor"
"65";5.6;2.9;3.6;1.3;"versicolor"
"66";6.7;3.1;4.4;1.4;"versicolor"
"67";5.6;3;4.5;1.5;"versicolor"
"68";5.8;2.7;4.1;1;"versicolor"
"69";6.2;2.2;4.5;1.5;"versicolor"
"70";5.6;2.5;3.9;1.1;"versicolor"
"71";5.9;3.2;4.8;1.8;"versicolor"
"72";6.1;2.8;4;1.3;"versicolor"
"73";6.3;2.5;4.9;1.5;"versicolor"
"74";6.1;2.8;4.7;1.2;"versicolor"
"75";6.4;2.9;4.3;1.3;"versicolor"
"76";6.6;3;4.4;1.4;"versicolor"
"77";6.8;2.8;4.8;1.4;"versicolor"
"78";6.7;3;5;1.7;"versicolor"
"79";6;2.9;4.5;1.5;"versicolor"
"80";5.7;2.6;3.5;1;"versicolor"
"81";5.5;2.4;3.8;1.1;"versicolor"
"82";5.5;2.4;3.7;1;"versicolor"
"83";5.8;2.7;3.9;1.2;"versicolor"
"84";6;2.7;5.1;1.6;"versicolor"
"85";5.4;3;4.5;1.5;"versicolor"
"86";6;3.4;4.5;1.6;"versicolor"
"87";6.7;3.1;4.7;1.5;"versicolor"
"88";6.3;2.3;4.4;1.3;"versicolor"
"89";5.6;3;4.1;1.3;"versicolor"
"90";5.5;2.5;4;1.3;"versicolor"
"91";5.5;2.6;4.4;1.2;"versicolor"
"92";6.1;3;4.6;1.4;"versicolor"
"93";5.8;2.6;4;1.2;"versicolor"
"94";5;2.3;3.3;1;"versicolor"
"95";5.6;2.7;4.2;1.3;"versicolor"
"96";5.7;3;4.2;1.2;"versicolor"
"97";5.7;2.9;4.2;1.3;"versicolor"
"98";6.2;2.9;4.3;1.3;"versicolor"
"99";5.1;2.5;3;1.1;"versicolor"
"100";5.7;2.8;4.1;1.3;"versicolor"
"101";6.3;3.3;6;2.5;"virginica"
"102";5.8;2.7;5.1;1.9;"virginica"
"103";7.1;3;5.9;2.1;"virginica"
"104";6.3;2.9;5.6;1.8;"virginica"
"105";6.5;3;5.8;2.2;"virginica"
"106";7.6;3;6.6;2.1;"virginica"
"107";4.9;2.5;4.5;1.7;"virginica"
"108";7.3;2.9;6.3;1.8;"virginica"
"109";6.7;2.5;5.8;1.8;"virginica"
"110";7.2;3.6;6.1;2.5;"virginica"
"111";6.5;3.2;5.1;2;"virginica"
"112";6.4;2.7;5.3;1.9;"virginica"
"113";6.8;3;5.5;2.1;"virginica"
"114";5.7;2.5;5;2;"virginica"
"115";5.8;2.8;5.1;2.4;"virginica"
"116";6.4;3.2;5.3;2.3;"virginica"
"117";6.5;3;5.5;1.8;"virginica"
"118";7.7;3.8;6.7;2.2;"virginica"
"119";7.7;2.6;6.9;2.3;"virginica"
"120";6;2.2;5;1.5;"virginica"
"121";6.9;3.2;5.7;2.3;"virginica"
"122";5.6;2.8;4.9;2;"virginica"
"123";7.7;2.8;6.7;2;"virginica"
"124";6.3;2.7;4.9;1.8;"virginica"
"125";6.7;3.3;5.7;2.1;"virginica"
"126";7.2;3.2;6;1.8;"virginica"
"127";6.2;2.8;4.8;1.8;"virginica"
"128";6.1;3;4.9;1.8;"virginica"
"129";6.4;2.8;5.6;2.1;"virginica"
"130";7.2;3;5.8;1.6;"virginica"
"131";7.4;2.8;6.1;1.9;"virginica"
"132";7.9;3.8;6.4;2;"virginica"
"133";6.4;2.8;5.6;2.2;"virginica"
"134";6.3;2.8;5.1;1.5;"virginica"
"135";6.1;2.6;5.6;1.4;"virginica"
"136";7.7;3;6.1;2.3;"virginica"
"137";6.3;3.4;5.6;2.4;"virginica"
"138";6.4;3.1;5.5;1.8;"virginica"
"139";6;3;4.8;1.8;"virginica"
"140";6.9;3.1;5.4;2.1;"virginica"
"141";6.7;3.1;5.6;2.4;"virginica"
"142";6.9;3.1;5.1;2.3;"virginica"
"143";5.8;2.7;5.1;1.9;"virginica"
"144";6.8;3.2;5.9;2.3;"virginica"
"145";6.7;3.3;5.7;2.5;"virginica"
"146";6.7;3;5.2;2.3;"virginica"
"147";6.3;2.5;5;1.9;"virginica"
"148";6.5;3;5.2;2;"virginica"
"149";6.2;3.4;5.4;2.3;"virginica"
"150";5.9;3;5.1;1.8;"virginica"

View file

@ -1,172 +0,0 @@
#Reading data file.
dat <- read.table("GSE60450_Lactation-GenewiseCounts.txt")
#Let us examine how our data looks.
View(dat)
#Seems like all data points are there. Can we improve appearance?
#Examine the details of read.table command.
?read.table
#Looks like we can inform read.table about separator type and presence of header.
dat <- read.table("GSE60450_Lactation-GenewiseCounts.txt",
header= TRUE, sep = "\t")
#Let us examine how data looks now.
View(dat)
#What are the observations represented in our data?
#colnames gives the names of columns of data.
colnames(dat)
#To check the number of rows and columns in table.
dim(dat)
#To check first few rows of table.
head(dat)
#To check last few rows of table.
tail(dat)
#To check basic stats for each column.
summary(dat)
#Let us extract a column of data.
#For example, EntrezGeneID.
geneIds <- dat$EntrezGeneID
#Check what kind of variable geneIds is.
class(geneIds)
#geneIds should be string type.
dat$EntrezGeneID <- as.character(dat$EntrezGeneID)
#Check the class of gene ids again
class(dat$EntrezGeneID)
#Information about samples is in another file.
phenotype_info <- read.table("targets.csv",
header = TRUE,
sep = ",")
#The column named GEO in the table represents sample id on GEO.
#Status and CellType are factor levels for statistical analysis.
phenotype_info$CellType <- as.factor(phenotype_info$CellType)
#Check class of CellType. It is a factor variable now.
class(phenotype_info$CellType)
#Currently Status and CellType has repetition of the same values.
#Get unique values.
celltypes <- unique(phenotype_info$CellType)
#Perhaps, no point in keeping spcs as factor in the above object.
#Convert celltypes to character variable.
celltypes <- as.character(celltypes)
#Checking if a text is present in a character variable?
"cardiomyocyte" %in% celltypes
#Let us say we want subset of data corresponding to B cells.
which_B <- phenotype_info$CellType == "B"
phenotype_info_B <- phenotype_info[which_B, ]
#Class of which_B? Logical
class(which_B)
#Alternative way to subset data.
phenotype_info_B <- subset(phenotype_info,
CellType == "B")
#Check mean counts for a sample.
mean(dat$MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1)
#Check mean counts for all genes in the B cell samples.
clnames_dat <- colnames(dat)[-2:-1]
clnames_dat <- strsplit(clnames_dat, split = "_")
clnames_dat <- data.frame(clnames_dat)
clnames_dat <- t(clnames_dat)
rownames(clnames_dat) <- NULL
clnames_dat <- clnames_dat[, 1]
which_B <- which(clnames_dat %in% phenotype_info_B$X)
cnts_B <- dat[, which_B + 2]
#Estimate median counts for casein protein.
median(unlist(dat[dat$EntrezGeneID == "12992", -2:-1]))
median(unlist(dat[dat$EntrezGeneID == "12992", which_B + 2]))
#Estimate standard deviation.
sd(unlist(dat[dat$EntrezGeneID == "12992", -2:-1]))
sd(unlist(dat[dat$EntrezGeneID == "12992", which_B + 2]))
#Check histograms of data.
hist(log2(1 + dat$MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1))
hist(log2(1 + dat$MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1))
#These histograms are not easy to compare.
#Perhaps, we can fix the axis limits.
hist(log2(1 + dat$MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1),
xlim = c(0, 20), ylim = c(0, 15000))
hist(log2(1 + dat$MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1),
xlim = c(0, 20), ylim = c(0, 15000))
#Let us look at boxplots.
boxplot(log2(1 + dat$MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1),
ylim = c(0, 25))
boxplot(log2(1 + dat$MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1),
ylim = c(0, 25))
#Scatter plots.
plot(x= log2(1 + dat$MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1),
y= log2(1 + dat$MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1))
#Are higher counts associated with longer genes?
#Can we color data points based on length?
#install.packages("ggplot2")
library(ggplot2)
qplot(x = log2(1 + MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1),
y = log2(1 + MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1),
data = dat[1:1000, ], color = log10(Length))
#Color scale does not give clear insight.
#Can we use discrete color scale of points instead?
dat_subset <- dat[1:1000, ]
dat_subset$genetype <- "smallGene"
dat_subset$genetype[dat_subset$Length > median(dat_subset$Length)] <- "longGene"
qplot(x = log2(1 + MCL1.DG_BC2CTUACXX_ACTTGA_L002_R1),
y = log2(1 + MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1),
data = dat_subset, color = genetype)
#Check the boxplots for long and small genes simultaneously.
qplot(x = genetype,
y = log2(1 + MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1),
data = dat_subset, geom = "boxplot")
#Additional stuff.
#Hypothesis test.
#Are counts for long genes significantly different from the small genes?
dat_small <- subset(dat_subset, genetype == "smallGene")
dat_long <- subset(dat_subset, genetype == "longGene")
test <- t.test(dat_small$MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1,
dat_long$MCL1.LC_BC2CTUACXX_GCCAAT_L001_R1)
test$p.value
#Conditional statements take a condition and perform steps depending on validitiy of the statement.
if (test$p.value < 0.05) {
print("Counts depend on gene length.")
} else {
print("Counts don't depend on gene length.")
}
#Looping for repeating the same task but on different data.
for (item in c("smallGene", "longGene")) {
y <- subset(dat_subset, genetype == item)
smry <- summary(y)
print(item)
print(smry)
}

View file

@ -1,13 +0,0 @@
,GEO,SRA,CellType,Status
MCL1.DG,GSM1480297,SRR1552450,B,virgin
MCL1.DH,GSM1480298,SRR1552451,B,virgin
MCL1.DI,GSM1480299,SRR1552452,B,pregnant
MCL1.DJ,GSM1480300,SRR1552453,B,pregnant
MCL1.DK,GSM1480301,SRR1552454,B,lactating
MCL1.DL,GSM1480302,SRR1552455,B,lactating
MCL1.LA,GSM1480291,SRR1552444,L,virgin
MCL1.LB,GSM1480292,SRR1552445,L,virgin
MCL1.LC,GSM1480293,SRR1552446,L,pregnant
MCL1.LD,GSM1480294,SRR1552447,L,pregnant
MCL1.LE,GSM1480295,SRR1552448,L,lactating
MCL1.LF,GSM1480296,SRR1552449,L,lactating
1 GEO SRA CellType Status
2 MCL1.DG GSM1480297 SRR1552450 B virgin
3 MCL1.DH GSM1480298 SRR1552451 B virgin
4 MCL1.DI GSM1480299 SRR1552452 B pregnant
5 MCL1.DJ GSM1480300 SRR1552453 B pregnant
6 MCL1.DK GSM1480301 SRR1552454 B lactating
7 MCL1.DL GSM1480302 SRR1552455 B lactating
8 MCL1.LA GSM1480291 SRR1552444 L virgin
9 MCL1.LB GSM1480292 SRR1552445 L virgin
10 MCL1.LC GSM1480293 SRR1552446 L pregnant
11 MCL1.LD GSM1480294 SRR1552447 L pregnant
12 MCL1.LE GSM1480295 SRR1552448 L lactating
13 MCL1.LF GSM1480296 SRR1552449 L lactating

View file

@ -0,0 +1,132 @@
/* Set font size and alignment for the message */
.bottom-message {
font-size: 0.8em !important;
font-style: italic !important;
text-align: center;
position: relative;
bottom: 0 !important;
left: 0;
right: 0;
}
/* Background color for code chunks */
.reveal pre code {
background-color: #d5d5d5 !important;
color: #333 !important;
font-size: 1.5em !important;
}
/* Left-align all code outputs */
.reveal pre code {
text-align: left !important;
}
/* Add horizontal scrolling to all code outputs */
.reveal pre code.output {
white-space: pre !important;
overflow-x: auto !important;
}
/* Change the font family used for code blocks */
pre, code, kbd, samp {
font-family: "Courier New", Courier, monospace;
}
/* Add horizontal scrolling to all code chunks */
.reveal pre code {
white-space: pre !important;
overflow-x: auto !important;
}
/* Bold slide titles and change color */
.reveal h2 {
font-weight: bold !important;
color: #9c0366;
}
/* Bold slide titles and change color */
.reveal h1 {
font-weight: bold !important;
color: #9c0366;
}
.reveal .slides>section:first-child h2 {
color: #333;
font-weight: normal !important;
}
/* Custom slide title */
.my-title-slide h1 {
font-weight: bold;
color: #9c0366;
}
.my-title-slide h2 {
color: #333;
font-weight: normal !important;
}
.reveal .slides>section:first-child h1 {
font-weight: bold !important;
color: #9c0366;
}
/* Increase the spacing between list items */
.reveal p {
text-align: left;
margin-left: 20px !important;
}
.reveal ul {
display: block;
margin-left: 75px !important;
margin-right: 50px !important;
}
.reveal ol {
display: block;
margin-left: 75px !important;
margin-right: 50px !important;
}
/* Increase the spacing between unordered list items */
.reveal ul li {
margin-bottom: 15px;
}
/* Increase the spacing between ordered list items */
.reveal ol li {
margin-bottom: 15px;
}
/* Make images fit in the slides, remove border, shadow and center align*/
.reveal img {
max-width: 100% !important;
max-height: 100% !important;
border: none !important;
box-shadow: none !important;
display: block !important;
margin: 0 auto !important;
margin-bottom: 10px !important;
}
/* Make a class for images that should be shrunk a little*/
.shrink {
max-width: 70% !important;
max-height: 70% !important;
margin: 0 auto !important;
}
small {
font-size: 70%;
}
.small-bullets ul {
font-size: 70%;
}
.less-small-bullets ul {
font-size: 80%;
}
.big-picture img{
max-width: 70%;
border: 1px solid black !important;
}

View file

@ -0,0 +1,273 @@
/* -------------------------------------------------------
*
* !! This file was generated by xaringanthemer !!
*
* Changes made to this file directly will be overwritten
* if you used xaringanthemer in your xaringan slides Rmd
*
* Issues or likes?
* - https://github.com/gadenbuie/xaringanthemer
* - https://www.garrickadenbuie.com
*
* Need help? Try:
* - vignette(package = "xaringanthemer")
* - ?xaringanthemer::style_xaringan
* - xaringan wiki: https://github.com/yihui/xaringan/wiki
* - remarkjs wiki: https://github.com/gnab/remark/wiki
*
* Version: 0.4.2
*
* ------------------------------------------------------- */
@import url(https://fonts.googleapis.com/css?family=Noto+Sans:400,400i,700,700i&display=swap);
@import url(https://fonts.googleapis.com/css?family=Cabin:600,600i&display=swap);
@import url(https://fonts.googleapis.com/css?family=Source+Code+Pro:400,700&display=swap);
:root {
/* Fonts */
--text-font-family: 'Noto Sans';
--text-font-is-google: 1;
--text-font-family-fallback: -apple-system, BlinkMacSystemFont, avenir next, avenir, helvetica neue, helvetica, Ubuntu, roboto, noto, segoe ui, arial;
--text-font-base: sans-serif;
--header-font-family: Cabin;
--header-font-is-google: 1;
--header-font-family-fallback: Georgia, serif;
--code-font-family: 'Source Code Pro';
--code-font-is-google: 1;
--base-font-size: 20px;
--text-font-size: 1rem;
--code-font-size: 0.9rem;
--code-inline-font-size: 1em;
--header-h1-font-size: 2.75rem;
--header-h2-font-size: 2.25rem;
--header-h3-font-size: 1.75rem;
/* Colors */
--text-color: #839496;
--header-color: #dc322f;
--background-color: #002b36;
--link-color: #b58900;
--text-bold-color: #d33682;
--code-highlight-color: #268bd240;
--inverse-text-color: #002b36;
--inverse-background-color: #fdf6e3;
--inverse-header-color: #002b36;
--inverse-link-color: #b58900;
--title-slide-background-color: #fdf6e3;
--title-slide-text-color: #002b36;
--header-background-color: #dc322f;
--header-background-text-color: #002b36;
}
html {
font-size: var(--base-font-size);
}
body {
font-family: var(--text-font-family), var(--text-font-family-fallback), var(--text-font-base);
font-weight: normal;
color: var(--text-color);
}
h1, h2, h3 {
font-family: var(--header-font-family), var(--header-font-family-fallback);
font-weight: 600;
color: var(--header-color);
}
.remark-slide-content {
background-color: var(--background-color);
font-size: 1rem;
padding: 16px 64px 16px 64px;
width: 100%;
height: 100%;
}
.remark-slide-content h1 {
font-size: var(--header-h1-font-size);
}
.remark-slide-content h2 {
font-size: var(--header-h2-font-size);
}
.remark-slide-content h3 {
font-size: var(--header-h3-font-size);
}
.remark-code, .remark-inline-code {
font-family: var(--code-font-family), Menlo, Consolas, Monaco, Liberation Mono, Lucida Console, monospace;
}
.remark-code {
font-size: var(--code-font-size);
}
.remark-inline-code {
font-size: var(--code-inline-font-size);
color: #6c71c4;
}
.remark-slide-number {
color: #586e75;
opacity: 1;
font-size: 0.9rem;
}
strong {
font-weight: bold;
color: var(--text-bold-color);
}
a, a > code {
color: var(--link-color);
text-decoration: none;
}
.footnote {
position: absolute;
bottom: 60px;
padding-right: 4em;
font-size: 0.9em;
}
.remark-code-line-highlighted {
background-color: var(--code-highlight-color);
}
.inverse {
background-color: var(--inverse-background-color);
color: var(--inverse-text-color);
}
.inverse h1, .inverse h2, .inverse h3 {
color: var(--inverse-header-color);
}
.inverse a, .inverse a > code {
color: var(--inverse-link-color);
}
.title-slide, .title-slide h1, .title-slide h2, .title-slide h3 {
color: var(--title-slide-text-color);
}
.title-slide {
background-color: var(--title-slide-background-color);
}
.title-slide .remark-slide-number {
display: none;
}
/* Two-column layout */
.left-column {
width: 20%;
height: 92%;
float: left;
}
.left-column h2, .left-column h3 {
color: #586e75;
}
.left-column h2:last-of-type, .left-column h3:last-child {
color: #93a1a1;
}
.right-column {
width: 75%;
float: right;
padding-top: 1em;
}
.pull-left {
float: left;
width: 47%;
}
.pull-right {
float: right;
width: 47%;
}
.pull-right + * {
clear: both;
}
img, video, iframe {
max-width: 100%;
}
blockquote {
border-left: solid 5px #cb4b16;
padding-left: 1em;
}
.remark-slide table {
margin: auto;
border-top: 1px solid #657b83;
border-bottom: 1px solid #657b83;
}
.remark-slide table thead th {
border-bottom: 1px solid #657b83;
}
th, td {
padding: 5px;
}
.remark-slide table:not(.table-unshaded) thead,
.remark-slide table:not(.table-unshaded) tfoot,
.remark-slide table:not(.table-unshaded) tr:nth-child(even) {
background: #073642;
}
table.dataTable tbody {
background-color: var(--background-color);
color: var(--text-color);
}
table.dataTable.display tbody tr.odd {
background-color: var(--background-color);
}
table.dataTable.display tbody tr.even {
background-color: #073642;
}
table.dataTable.hover tbody tr:hover, table.dataTable.display tbody tr:hover {
background-color: rgba(255, 255, 255, 0.5);
}
.dataTables_wrapper .dataTables_length, .dataTables_wrapper .dataTables_filter, .dataTables_wrapper .dataTables_info, .dataTables_wrapper .dataTables_processing, .dataTables_wrapper .dataTables_paginate {
color: var(--text-color);
}
.dataTables_wrapper .dataTables_paginate .paginate_button {
color: var(--text-color) !important;
}
/* Horizontal alignment of code blocks */
.remark-slide-content.left pre,
.remark-slide-content.center pre,
.remark-slide-content.right pre {
text-align: start;
width: max-content;
max-width: 100%;
}
.remark-slide-content.left pre,
.remark-slide-content.right pre {
min-width: 50%;
min-width: min(40ch, 100%);
}
.remark-slide-content.center pre {
min-width: 66%;
min-width: min(50ch, 100%);
}
.remark-slide-content.left pre {
margin-left: unset;
margin-right: auto;
}
.remark-slide-content.center pre {
margin-left: auto;
margin-right: auto;
}
.remark-slide-content.right pre {
margin-left: auto;
margin-right: unset;
}
/* Slide Header Background for h1 elements */
.remark-slide-content.header_background > h1 {
display: block;
position: absolute;
top: 0;
left: 0;
width: 100%;
background: var(--header-background-color);
color: var(--header-background-text-color);
padding: 2rem 64px 1.5rem 64px;
margin-top: 0;
box-sizing: border-box;
}
.remark-slide-content.header_background {
padding-top: 7rem;
}
@page { margin: 0; }
@media print {
.remark-slide-scaler {
width: 100% !important;
height: 100% !important;
transform: scale(1) !important;
top: 0 !important;
left: 0 !important;
}
}