Gladstone-Bioinformatics-Wo.../intro-sc-atac-seq/ArchR_demo_2_analysis.Rmd

---
title: 'Hands-on component of Single-cell ATAC-seq workshop'
author: "Ayushi Agrawal"
date: "2/9/2023"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Introduction

We'll review the current practices for some of the common steps in scATAC-seq data analysis in the Sessions 1-2. We'll discuss different practices for each step and the assumptions underlying various tools and their limitations. This document provides an exposure to one of the popular tools for such analysis. As we'll discuss, the right choice of method in any application depends on a number of factors, including the biological systems under study and the characteristics of the data in hand. For the purpose of our workshop, we'll limit the hands-on component to the ArchR package in R. In general, analysis might require multiple tools in different languages and/or novel development. For more, please see the slide deck in materials.

The following is based on [this vignette](https://www.archrproject.com//articles/Articles/tutorial.html) from the ArchR developers. Please note that ArchR is designed to be run on Unix-based operating systems such as macOS and linux. ArchR is NOT supported on Windows or other operating systems.

## Setup the working environment

```{r message=FALSE, warning=FALSE}
#load the ArchR library
library(ArchR)

#set a seed to facilitate replication of operations requiring randomization
set.seed(1)

#default number of Parallel threads is 16
# working on a local computer, 1 thread works best
addArchRThreads(1)

#Before we begin, we need add a reference genome annotation for ArchR
#to have access to chromosome and gene information. ArchR supports hg19, hg38, mm9, and mm10.
addArchRGenome("hg38")
```

## Create a vector with Arrow file names
Please refer to ArchR_demo_1_create_arrow_files.Rmd notebook for the steps to create arrow files.

Strict quality control (QC) of scATAC-seq data is essential to remove the contribution of low-quality cells. In ArchR, we consider three characteristics of data:

  1. The number of unique nuclear fragments (i.e. not mapping to mitochondrial DNA).
  2. The signal-to-background ratio. Low signal-to-background ratio is often attributed to dead or dying cells which have
  de-chromatinzed DNA which allows for random transposition genome-wide.
  3. The fragment size distribution. Due to nucleosomal periodicity, we expect to see depletion of fragments that are the
  length of DNA wrapped around a nucleosome (approximately 147 bp).

```{r}
# create a character vector of Arrow file paths
ArrowFiles <- list.files(path = "arrow_files", pattern = "3k.arrow", full.names = TRUE)

ArrowFiles
```


## Creating an ArchRProject
With our Arrow files in hand, we are now ready to create an ArchRProject. An ArchRProject is associated with a set of Arrow files and is the backbone of nearly all ArchR analyses.

```{r}
#Create ArchR project
demo_proj <- ArchRProject(
  ArrowFiles = ArrowFiles,
  outputDirectory = "results_singularity",
  copyArrows = TRUE #This is recommened so that you maintain an unaltered copy for later usage.
)

#We can also ask which data matrices are available within the ArchRProject
#which will be useful downstream once we start adding to this project.
getAvailableMatrices(demo_proj)
```
## Inferring Doublets
After Arrow file creation, we can infer potential doublets (a single droplet containing multiple cells) that can confound downstream results. This is done using the addDoubletScores() function.

```{r}
doubScores <- addDoubletScores(
  input = demo_proj,
  k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
  knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search with doublet projection.
  LSIMethod = 1
)
```

## Filter out the doublets
Now we can filter putative doublets based on the previously determined doublet scores using the filterDoublets() function. This doesn’t physically remove data from the Arrow files but rather tells the ArchRProject to ignore these cells for downstream analysis.

```{r}
#filtering based on QC plots generated by ArchR
#demo_proj<- demo_proj[demo_proj$TSSEnrichment > 10 & demo_proj$nFrags > 10000 ]

#filter out doublets
demo_proj <- filterDoublets(demo_proj)

#save the ArchR Project
saveArchRProject(ArchRProj = demo_proj,  load = FALSE)
```

## Dimensionality reduction & clustering
ArchR implements an iterative LSI dimensionality reduction via the addIterativeLSI() function.

```{r}
demo_proj <- addIterativeLSI(
  ArchRProj = demo_proj,
  useMatrix = "TileMatrix",
  name = "IterativeLSI",
  clusterParams = list(
    resolution = 0.2, #might want to start with lower resolution. This parameter is passed to Seurat's FindClusters
    sampleCells = 10000,
    n.start = 10
  ),
  varFeatures = 25000,
  dimsToUse = 1:30
)
```

To call clusters in this reduced dimension sub-space, we use the addClusters() function which uses Seurat’s graph clustering as the default clustering method.

```{r}
demo_proj <- addClusters(
  input = demo_proj,
  reducedDims = "IterativeLSI",
  method = "Seurat",
  name = "Clusters",
  resolution = 0.2 #The final clustering is done in the LSI space. Once LSI is finalized, tune this parameter to control number of clusters. It is passed to Seurat's FindClusters.
)
```

## Visualizing in a 2D UMAP Embedding
We can visualize our scATAC-seq data using a 2-dimensional representation such as Uniform Manifold Approximation and Projection (UMAP). To do this, we add a UMAP embedding to our ArchRProject object with the addUMAP() function. This function uses the uwot package to perform UMAP.

```{r}
demo_proj <- addUMAP(
  ArchRProj = demo_proj,
  reducedDims = "IterativeLSI",
  name = "UMAP",
  minDist = 0.8,
  force = TRUE
)

#Save ArchR project
saveArchRProject(ArchRProj = demo_proj, load = TRUE)
```

Using this UMAP, we can visualize various attributes of our cells which are stored in a matrix called cellColData in our ArchRProject. To do this, we use the plotEmbedding() function and we specify the variable to use for coloration via a combination of the colorBy and name parameters.

```{r}
#color UMAP by “Sample”
p1 <- plotEmbedding(ArchRProj = demo_proj, colorBy = "cellColData",
                    name = "Sample", embedding = "UMAP")

#color UMAP by “Clusters”
p2 <- plotEmbedding(ArchRProj = demo_proj, colorBy = "cellColData",
                    name = "Clusters", embedding = "UMAP")

#view the plots
ggAlignPlots(p1, p2, type = "h")

#save an editable vectorized version of this plot
plotPDF(p1,p2, name = "Plot-UMAP-Sample-Clusters.pdf",
        ArchRProj = demo_proj, addDOC = FALSE, width = 5, height = 5)
```

##Harmony-based batch correction
```{r}
demo_proj <- addHarmony(
    ArchRProj = demo_proj,
    reducedDims = "IterativeLSI",
    name = "Harmony",
    groupBy = "Sample"
)

demo_proj <- addUMAP(
  ArchRProj = demo_proj,
  reducedDims = "Harmony",
  name = "UMAP",
  minDist = 0.8,
  force = TRUE
)

demo_proj <- addClusters(
  input = demo_proj,
  reducedDims = "Harmony",
  method = "Seurat",
  name = "Clusters",
  force = TRUE,
  resolution = 0.2 #The final clustering is done in the LSI space. Once LSI is finalized, tune this parameter to control number of clusters. It is passed to Seurat's FindClusters.
)

#color UMAP by “Sample”
p1 <- plotEmbedding(ArchRProj = demo_proj, colorBy = "cellColData",
                    name = "Sample", embedding = "UMAP")

#color UMAP by “Clusters”
p2 <- plotEmbedding(ArchRProj = demo_proj, colorBy = "cellColData",
                    name = "Clusters", embedding = "UMAP")

#view the plots
ggAlignPlots(p1, p2, type = "h")

plotPDF(p1,p2, name = "Plot-UMAP-Sample-Clusters-Batch-Corrected.pdf",
        ArchRProj = demo_proj, addDOC = FALSE, width = 5, height = 5)

```

## Identifying Marker Genes
ArchR enables unbiased identification of marker features for any given cell groupings (for example, clusters). These features can be anything - peaks, genes (based on gene scores), or transcription factor motifs (based on chromVAR deviations). ArchR does this using the getMarkerFeatures() function which can take as input any matrix via the useMatrix parameter and it identifies features unique to the groups indicated by the groupBy parameter. If the useMatrix parameter is set to “GeneScoreMatrix”, then the function will identify the genes that appear to be uniquely active in each cell type. This provides an unbiased way of seeing which genes are predicted to be active in each cluster and can aid in cluster annotation.

```{r}
markersGS <- getMarkerFeatures(
  ArchRProj = demo_proj,
  useMatrix = "GeneScoreMatrix",
  groupBy = "Clusters",
  bias = c("TSSEnrichment", "log10(nFrags)"),
  testMethod = "wilcoxon"
)

#We can get a list of DataFrame objects, one for each of our clusters,
#containing the relevant marker features using the getMarkers() function.
markerList <- getMarkers(markersGS, cutOff = "FDR <= 0.01 & Log2FC >= 1.25")
markerList$C6

#save the marker genes object
saveRDS(markersGS, file = paste0("results_singularity/", "markerGS.rds"))
```

## Calling Peaks with TileMatrix
Calling peaks is one of the most fundamental processes in ATAC-seq data analysis. Because per-cell scATAC-seq data is essentially binary (accessible or not accessible), we cannot call peaks on an individual cell basis. For this reason, we defined groups of cells, typically clusters, in the previous steps.

By default ArchR attempts to call peaks using MACS2; however, ArchR also implements its own native peak caller which could be used when MACS2 cannot be installed. While we have benchmarked this peak caller against MACS2 and note very similar performances, we do not recommend using this native peak caller unless absolutely necessary. The ArchR native peak caller calls peaks on the 500-bp TileMatrix and we indicate to addReproduciblePeakSet() that we want to use this peak caller via the peakMethod parameter.

```{r}
demo_proj <- addGroupCoverages(ArchRProj = demo_proj, groupBy = "Clusters")
demo_proj <- addReproduciblePeakSet(
    ArchRProj = demo_proj,
    groupBy = "Clusters",
    pathToMacs2 = findMacs2(),
    method = "q"
)

# ##if you don't have macs2 installed on your computer
# demo_proj <- addReproduciblePeakSet(
#     ArchRProj = demo_proj,
#     groupBy = "Clusters",
#     peakMethod =  "Tiles",
#     method = "q"
# )

#We can similarly examine this merged peak set.
getPeakSet(demo_proj)

```
##Marker peaks
```{r}
demo_proj <- addPeakMatrix(demo_proj)
getAvailableMatrices(demo_proj)
markersPeaks <- getMarkerFeatures(
    ArchRProj = demo_proj,
    useMatrix = "PeakMatrix",
    groupBy = "Clusters",
  bias = c("TSSEnrichment", "log10(nFrags)"),
  testMethod = "wilcoxon"
)

markerList <- getMarkers(markersPeaks, cutOff = "FDR <= 0.01 & Log2FC >= 1")
markerList[[1]]

```

```{r}
heatmapPeaks <- markerHeatmap(
  seMarker = markersPeaks,
  cutOff = "FDR <= 0.1 & Log2FC >= 0.5",
  transpose = TRUE
)
draw(heatmapPeaks, heatmap_legend_side = "bot", annotation_legend_side = "bot")
plotPDF(heatmapPeaks, name = "Peak-Marker-Heatmap", width = 8, height = 6, ArchRProj = demo_proj, addDOC = FALSE)
```
## Motif Enrichment
```{r}
##need to install chromVARmotifs
## devtools::install_github("GreenleafLab/chromVARmotifs")
require(BSgenome.Hsapiens.UCSC.hg38)
demo_proj <- addMotifAnnotations(ArchRProj = demo_proj, motifSet = "cisbp", name = "Motif")

enrichMotifs <- peakAnnoEnrichment(
    seMarker = markersPeaks,
    ArchRProj = demo_proj,
    peakAnnotation = "Motif",
    cutOff = "FDR <= 0.1 & Log2FC >= 0.5"
  )

heatmapEM <- plotEnrichHeatmap(enrichMotifs, n = 7, transpose = TRUE)
ComplexHeatmap::draw(heatmapEM, heatmap_legend_side = "bot", annotation_legend_side = "bot")

plotPDF(heatmapEM, name = "Motifs-Enriched-Marker-Heatmap", width = 8, height = 6, ArchRProj = demo_proj, addDOC = FALSE)

```
##Chrom var deviations
```{r}
demo_proj <- addBgdPeaks(demo_proj)

demo_proj <- addDeviationsMatrix(
  ArchRProj = demo_proj,
  peakAnnotation = "Motif",
  force = TRUE
)

plotVarDev <- getVarDeviations(demo_proj, name = "MotifMatrix", plot = TRUE)
plotVarDev
```
```{r}
motifs <- c("GATA1", "CEBPA",  "IRF4", "TBX21", "PAX5")
markerMotifs <- getFeatures(demo_proj, select = paste(motifs, collapse="|"), useMatrix = "MotifMatrix")
markerMotifs

markerMotifs <- grep("z:", markerMotifs, value = TRUE)
markerMotifs

demo_proj <- addImputeWeights(demo_proj)
p <- plotGroups(ArchRProj = demo_proj,
  groupBy = "Clusters",
  colorBy = "MotifMatrix",
  name = markerMotifs,
  imputeWeights = getImputeWeights(demo_proj)
)

p2 <- lapply(seq_along(p), function(x){
  if(x != 1){
    p[[x]] + guides(color = FALSE, fill = FALSE) +
    theme_ArchR(baseSize = 6) +
    theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "cm")) +
    theme(
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()
    ) + ylab("")
  }else{
    p[[x]] + guides(color = FALSE, fill = FALSE) +
    theme_ArchR(baseSize = 6) +
    theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "cm")) +
    theme(
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank()
    ) + ylab("")
  }
})
do.call(cowplot::plot_grid, c(list(nrow = 1, rel_widths = c(2, rep(1, length(p2) - 1))),p2))
```

```{r}
sessionInfo()
```

## THE END.