Jan 2023 update

This commit is contained in:
reubenthomas 2023-01-27 07:39:02 -08:00
parent 4d830c1de6
commit f7cfab4033

View file

@ -1,7 +1,7 @@
---
title: 'single-cell RNA-seq workshop: Session 3'
author: "Reuben Thomas"
date: "11/10/2020"
date: "10/25/2021"
output: html_document
---
@ -11,11 +11,10 @@ knitr::opts_chunk$set(echo = TRUE)
## Background
Krishna went over a series of steps involved in processing scRNA-seq data in general. The aims of these steps included loading, filtering, normalizing for differences between cells, visualizing and clustering the data. He used the data from this study that aimed to understand the effects of stromal cells in developing tumors.
We went over a series of steps involved in processing scRNA-seq data in general. The aims of these steps included loading, filtering, normalizing for differences between cells, visualizing and clustering the data. He used the data from this study that aimed to understand the effects of stromal cells in developing tumors.
To further illustrate and develop the ideas, methods and steps, I will use a subset of cells from this melanoma data - cells marked by CD45- GFP+ CD31- or inferred as cancer-associated fibroblasts at the 5 day and 11 day time-points. The main reason for choosing this subset is so that we have data of a managable size to perform the planned analyses in this section in the alloted time. You should be able to extend the methods/code to data with larger number of cells and/or variables.
**Biological question:**
1. Identify the main cell-types (or clusters) in the data using marker genes
@ -27,8 +26,7 @@ To further illustrate and develop the ideas, methods and steps, I will use a sub
We are interested in the effect of time on gene expression in cancer-associated fibroblasts. However, the expression of gene in a cell is variable not just because of biological reasons like cell-to-cell (intra-animal) and animal-to-animal (inter-animal) variability but also due to technical reasons like the differences in sequencing depth from cell-to-cell, library preparation, animal handling etc. If we don't fully account for these sources of variation then our results/interpretation may be incorrect. For example, the clustering of cells may be driven by some techninal factors.
Ideally, the claim we would like to make would be as generalizable as possible, i.e., if somebody else were to repeat the experiment above, go back and randomly sample animals, randomly sample cells from each of these animals at two time-points and sequence the RNA in these cells they would make similar claims. So we would like to demonstrate to a sceptical reviewer that despite all the variability in expression we can claim that the fact that we observe mean expression of a gene at day 11 is _x_ times higher than its expression at day 5 is unlikely to driven by random chance. Therefore, in arriving at our conclusions we would formally need to account for the different sources of variation. We will do go over these four steps in the following code:
Ideally, the claim we would like to make would be as generalizable as possible, i.e., if somebody else were to repeat the experiment above, go back and randomly sample animals, randomly sample cells from each of these animals at two time-points and sequence the RNA in these cells they would make similar claims. So we would like to demonstrate to a sceptical reviewer that despite all the variability in expression we can claim that the fact that we observe mean expression of a gene at day 11 is *x* times higher than its expression at day 5 is unlikely to driven by random chance. Therefore, in arriving at our conclusions we would formally need to account for the different sources of variation. We will do go over these four steps in the following code:
1. Normalization
2. Identification of marker genes
@ -44,9 +42,8 @@ suppressMessages(library(muscat))
suppressMessages(library(SummarizedExperiment))
suppressMessages(library(tidyverse))
suppressMessages(library(magrittr))
#suppressMessages(library(snifter))
#source a function I slightly modified from the muscat package for the within-cluster multi-sample multi-condition comparison
source("pbDS_update.R")
raw_data <- read.csv("rawCounts.csv", header = T)
pheno_data <- read.csv("sub_pheno_data.csv", header = T)
@ -111,12 +108,15 @@ data <- subset(x=data, subset=nFeature_RNA > 200 & nCount_RNA > quantnCountRNA &
print(sprintf("After filtering outliers: %d cells and %d genes", ncol(data), nrow(data)))
```
## Normalization
Now, we will perform sctranform based normalization and visualize the results
```{r}
data <- SCTransform(data, method="qpoisson", vars.to.regress = NULL)
data <- RunPCA(data, verbose = FALSE)
data <- RunTSNE(data, dims = 1:30, verbose = FALSE)
data <- RunTSNE(data, dims = 1:30, verbose = FALSE, method="Flt-SNE", initialization="pca", learning_rate=2400, perplexity=30)
data <- FindNeighbors(data, dims = 1:30, verbose = FALSE)
data <- FindClusters(data, verbose = FALSE)
@ -130,9 +130,11 @@ FeatureScatter(object=data, feature1="nFeature_RNA", feature2="PC_1")
FeaturePlot(object = data, features = "nFeature_RNA")
FeaturePlot(object = data, features = "PC_1")
```
##Find marker genes for each cluster
We will find markers using the Wilcoxon two-sample test.
##Find marker genes for each cluster We will find markers using the Wilcoxon two-sample test.
```{r}
## Find markers for each of the 5 clusters
# MarkersRes <- FindAllMarkers(data, assay = "SCT", slot = "data", test.use = "wilcox", return.thresh = 1e-6)
@ -149,22 +151,6 @@ VlnPlot(cluster0_data, features = "Tac1-17705", group.by = "Time")
VlnPlot(cluster0_data, features = "Gm10116-5887", group.by = "Time")
VlnPlot(cluster0_data, features = "Hspa1l-9322", group.by = "Time")
# pb@assays@data[[1]][row.names(pb@assays@data[[1]]) == "Tac1-17705",]
# MarkersRes0[row.names(MarkersRes0)=="Tac1-17705",]
# pb@assays@data[[1]][row.names(pb@assays@data[[1]]) == "Gm10116-5887",]
# temp_ClusterResult <- tbl[[1]]
# temp_topClusterResult <- temp_ClusterResult[temp_ClusterResult$p_adj.loc < 1, ]
# temp_topClusterResult <- temp_topClusterResult[order(temp_topClusterResult$p_adj.loc),]
# temp_topClusterResult <- data.frame(Gene=row.names(temp_topClusterResult), temp_topClusterResult)
# temp_topClusterResult[row.names(temp_topClusterResult)=="Gm10116-5887", ]
#
# pb@assays@data[[1]][row.names(pb@assays@data[[1]]) == "Hspa1l-9322",]
# temp_ClusterResult <- tbl[[1]]
# temp_topClusterResult <- temp_ClusterResult[temp_ClusterResult$p_adj.loc < 1, ]
# temp_topClusterResult <- temp_topClusterResult[order(temp_topClusterResult$p_adj.loc),]
# temp_topClusterResult <- data.frame(Gene=row.names(temp_topClusterResult), temp_topClusterResult)
# temp_topClusterResult[row.names(temp_topClusterResult)=="Hspa1l-9322", ]
## Find genes associated with time in cluster 1.
## Note: this is not the recommended approach to do this since instead of individual mice being treated as replicates, the cells are being treated as such
@ -174,9 +160,11 @@ VlnPlot(cluster1_data, features = "Il1rl1-185", group.by = "Time")
```
## Multi-sample multi-condition comparison
We will aggregate the counts across cells from each mouse within each cluster so that now we will be able to perform a pseudo-bulk RNA-seq differential expression separtely within each cluster. Note, we will be able to treat individual mice as replicates with this analyses. For the analyses in this section, we are moving away from Seurat. Hence we need to create a new single-cell RNA-seq object that the typical bioconductor package will recognise.
### Within -cluster comparison
The bioconductor package **muscat** will help us with this analysis. I had to slightly modify the primary function in this package (*pbDS*) to work with this data. The modified version (that you have downloaded) is what I sourced at the start of this document. I illustrate below how to set up the design matrix to perform the differential expression analyses. I prefer this approach as opposed to typical approach to assuming a two condition comparison. This way you have a lot of flexibility in modeling more complex design with more than one variable along with interactions between variables of interest.
```{r}
@ -229,45 +217,49 @@ pb <- aggregateData(sce,
group_id <- colData(pb)[,1]
mm <- model.matrix(~ group_id)
colnames(mm) <- levels(pb$group_id)
row.names(mm) <- row.names(colData(pb))
# run DS analysis; This is only two group/sample comparison, so we are interested in testing the significance of the second coefficient (that represent the log FC of gene expression between 5 day time-point and the 11 day time-point)
##Note, here I am using the modified version of the function included in the muscat package
res <- pbDS_update(pb, design=mm, coef=c(2),method="edgeR", verbose=TRUE, min_cells = 3)
res <- pbDS(pb, design=mm, coef=2,method="edgeR", verbose=TRUE, min_cells = 3)
##The results are going to be placed in topClusterResults object
tbl <- res$table
tbl <- res$table$X5_day
ClusterNo <- 0
p_thresh <- 0.05
topClusterResults <- NULL
for(i in 1:length(tbl)) {
temp_ClusterResult <- tbl[[i]]
temp_topClusterResult <- temp_ClusterResult[temp_ClusterResult$p_adj.loc < p_thresh, ]
temp_topClusterResult <- temp_topClusterResult[order(temp_topClusterResult$p_adj.loc),]
temp_topClusterResult <- data.frame(Gene=row.names(temp_topClusterResult), temp_topClusterResult)
topClusterResults <- rbind(topClusterResults, temp_topClusterResult)
temp_ClusterResult %<>% filter(p_val < p_thresh)
temp_ClusterResult %<>% slice(order(p_val))
temp_ClusterResult %<>% rownames_to_column(., var = "Gene")
topClusterResults %<>% rbind(., temp_ClusterResult)
}
head(topClusterResults)
```
### Between -cluster comparison
For this analyses, we will use the **lme4** package in R to fit generalized linear mixed effects models. We are going the model the change in the chance (or more formally the odds) of cells from a given mouse belonging to a given cluster from the 5 day to the 11 day time-point. The random effects part of these models captures the inherent correlation between the cells coming from the same mouse
```{r}
##the number of cells in each mouse in each cluster
Ncells <- as.data.frame(metadata(pb)$n_cells)
print(Ncells)
Ncells <- do.call("rbind",(pb@int_colData$n_cells))
TotalCells <- rowSums(Ncells, na.rm = T)
names(TotalCells) <- row.names(Ncells)
##determine the total number of cells per mouse
Ncells <- Ncells %>%
filter(Freq > 0)
short_Ncells <- Ncells %>%
spread(sample_id, Freq)
TotalCells <- colSums(short_Ncells[,-1], na.rm = T)
names(TotalCells) <- colnames(short_Ncells)[-1]
Ncells %<>% as.data.frame() %>% rownames_to_column(., var = "sample_id")
Ncells %<>% gather(.,"cluster_id", "Freq", -"sample_id")
Clusters <- unique(Ncells$cluster_id)
Nclusters <- length(Clusters)
Ncells %<>% mutate(TotalCells = rep(TotalCells, Nclusters))
##load the cluster info and the meta-data per mouse
Clusters <- unique(Ncells$cluster_id)
SampleInfo <- colData(pb)
SampleInfo %<>% as.data.frame() %>%
rownames_to_column(var = "sample_id")
Ncells %<>% merge(., SampleInfo)
##function to estimate the change in the odds of cluster membership from the 5 day to the 11 day time-point
estimateCellStateChange <- function(k, Ncells, TotalCells, SampleInfo) {
require(lme4)
@ -275,24 +267,30 @@ estimateCellStateChange <- function(k, Ncells, TotalCells, SampleInfo) {
print(paste("Cluster", k))
Ncells_sub <- Ncells %>%
filter(cluster_id==k)
FitData <- NULL
TempIndices <- match(Ncells_sub$sample_id, names(TotalCells))
TotalCells_sub <- TotalCells[TempIndices]
for(i in 1:length(TotalCells_sub)) {
InCluster=c(rep(1, Ncells_sub$Freq[i]), rep(0, (TotalCells_sub[i]-Ncells_sub$Freq[i])))
Time=rep(SampleInfo$group_id, TotalCells_sub[i])
ID=rep(row.names(SampleInfo)[i], TotalCells_sub[i])
TempData <- data.frame(ID, InCluster, Time)
FitData <- rbind(FitData, TempData)
}
FitData$Time <- relevel(FitData$Time, ref="X5_day")
FitData$InCluster <- as.factor(FitData$InCluster)
glmerFit1 <- glmer(InCluster ~ (1|ID) + Time, data=FitData, family = "binomial")
sglmerFit1 <- summary(glmerFit1)
TempRes1 <- (sglmerFit1$coefficients[-1,])
glmerFit <- glmer(cbind(Freq, (TotalCells - Freq)) ~ (1|sample_id) + group_id, data=Ncells_sub, family = "binomial", control = glmerControl(optimizer = "bobyqa"))
sglmerFit <- summary(glmerFit)
TempRes <- (sglmerFit$coefficients[-1,])
return(TempRes1)
#
# FitData <- NULL
# TempIndices <- match(Ncells_sub$sample_id, names(TotalCells))
# TotalCells_sub <- TotalCells[TempIndices]
# for(i in 1:length(TotalCells_sub)) {
# InCluster=c(rep(1, Ncells_sub$Freq[i]), rep(0, (TotalCells_sub[i]-Ncells_sub$Freq[i])))
# Time=rep(SampleInfo$group_id, TotalCells_sub[i])
# ID=rep(row.names(SampleInfo)[i], TotalCells_sub[i])
# TempData <- data.frame(ID, InCluster, Time)
# FitData <- rbind(FitData, TempData)
# }
# FitData$Time <- relevel(FitData$Time, ref="X5_day")
# FitData$InCluster <- as.factor(FitData$InCluster)
# glmerFit1 <- glmer(InCluster ~ (1|ID) + Time, data=FitData, family = "binomial")
# sglmerFit1 <- summary(glmerFit1)
# TempRes1 <- (sglmerFit1$coefficients[-1,])
#
#
return(TempRes)
}
ClusterRes <- sapply(Clusters, estimateCellStateChange, Ncells, TotalCells, SampleInfo)
@ -304,15 +302,20 @@ ClusterRes <- data.frame(ClusterRes)
colnames(ClusterRes)[c(1,4)] <- c("logOddsRatio_11day_vs_5day","pvalue")
##perform multiple-testing correction
ClusterRes <- data.frame(ClusterRes, p.adjust= p.adjust(ClusterRes$pvalue, method = "BH"))
##output the results
print(ClusterRes)
ClusterRes %<>% mutate(p.adjust = p.adjust(pvalue, method = "BH"))
ggplot(Ncells, aes(x=group_id, y=(Freq+1)/TotalCells)) +
geom_boxplot() +
geom_point() +
scale_y_log10() +
ylab("proportion of cells") +
facet_grid(cols = vars(cluster_id))
```
## Batch correction
Below, we will go over code to remove potential batch effects in the data using the IntegrateData function in Seurat 3. We will use the same data as before, assuming that the data was generated in two batches. It was not, :). But lets pretend anyway.
```{r}
## Let us visualize the uncorrected data
DimPlot(data, reduction = "tsne", group.by = "Individual")
@ -350,7 +353,7 @@ DimPlot(batch.integrated, reduction = "tsne", group.by = "Time")
DimPlot(batch.integrated, reduction = "tsne", group.by = "batch")
```
```{r}
sessionInfo()
```