Let us load the libraries requires for the various analyses described in this document
##libraries for "tidy" manipulation of data
suppressMessages(library(tidyverse))
##libraries for "tidy" manipulation of data
suppressMessages(library(magrittr))
##library used for normalizing gene expression data and then perform statistical association of gene expression with tumor vs normal comparison of bladder cancer samples
suppressMessages(library(DESeq2))
##library used for generating a Volcano Plot
suppressMessages(library(EnhancedVolcano))
##library to illustrate the use of Over Representation Analyses (ORA) and Gene Set Enrichment Analyses (GSEA) with gene permutation
suppressMessages(library(clusterProfiler))
##library to illustrate the use of Simulataneous Enrichment Analyses (SEA)
suppressMessages(library(rSEA))
##library to illustrate the use of Significance Analysis of Function and Expression (SAFE), Pathway Analysis with Down-weighting of Overlapping Genes (PADOG) and Gene Set Enrichment Analyses (GSEA) with sample permutation
suppressMessages(library(GSEABenchmarkeR))
What are the biological pathways/gene sets differerntially regulated between the tumor and normal tissues in bladder cancer patients?
The gene expression we will work with are assayed using RNA-seq in the tumor and normal tissues drawn from 19 subjects with bladder cancer. These data are derived from The Cancer Genome Atlas (TCGA).
The methods we will use to answer the scientific question are described below:
Load the gene expression data and understand the study design
Perform differential expression analyses
Run six different enrichment analyses methods.
Note: In normal practice we may run only one or at most two methods to answer our question. However, our purpose here is to illustrate the use of different methods, higlight and interpret their results in the context of the associated assumptions of each method. The choice of the methods we use will depend on …
… the nature of our hypothesis, i.e., are we interested in a very specific biochemical pathway? or
… are we agnostic of the nature of the biochemical pathways we discover to be asssociated with what we are studying?,
… if we want to interpret the resulting p-values as measures of reproducibility of our enriched pathways by other research groups using data derived from new bladder cancer patient samples?
… whether the assay we are using is a genome-wide assay or a very targeted assay focusing on a specific group of genes or proteins
The gene expression data will be loaded as a SummarizedExperiment object in an RDS file.
tcga <- readRDS("bladder_cancer_tcga_summarized_experiment.rds")
##short summary of tcga. Note the 12,264 rownames represent the gene names as Entrez IDs
tcga
## class: SummarizedExperiment
## dim: 12264 38
## metadata(3): annotation dataId dataType
## assays(1): exprs
## rownames(12264): 2 144568 ... 23140 26009
## rowData names(0):
## colnames(38): TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-BT-A20W-01A-21R-A14Y-07
## ... TCGA-GC-A6I3-11A-11R-A31N-07 TCGA-GD-A2C5-11A-11R-A180-07
## colData names(4): sample type GROUP BLOCK
print("Short summary of the RNA-seq samples")
## [1] "Short summary of the RNA-seq samples"
##quick summary of 38 samples. Note the variable GROUP refers to tumor vs normal assignment while the variable BLOCK refers to the patient. From each of the 19 patients, tumor and normal tissue are derived and assayed for gene expression
colData(tcga)
## DataFrame with 38 rows and 4 columns
## sample type GROUP
## <character> <factor> <numeric>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV-01A-11R-A22U-07 BLCA 1
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W-01A-21R-A14Y-07 BLCA 1
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI-01A-11R-A28M-07 BLCA 1
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N-01A-11R-A14Y-07 BLCA 1
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J-01A-11R-A277-07 BLCA 1
## ... ... ... ...
## TCGA-BT-A2LB-11A-11R-A18C-07 TCGA-BT-A2LB-11A-11R-A18C-07 BLCA 0
## TCGA-K4-A54R-11A-11R-A26T-07 TCGA-K4-A54R-11A-11R-A26T-07 BLCA 0
## TCGA-GC-A3WC-11A-11R-A22U-07 TCGA-GC-A3WC-11A-11R-A22U-07 BLCA 0
## TCGA-GC-A6I3-11A-11R-A31N-07 TCGA-GC-A6I3-11A-11R-A31N-07 BLCA 0
## TCGA-GD-A2C5-11A-11R-A180-07 TCGA-GD-A2C5-11A-11R-A180-07 BLCA 0
## BLOCK
## <character>
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-K4-A3WV
## TCGA-BT-A20W-01A-21R-A14Y-07 TCGA-BT-A20W
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-K4-A5RI
## TCGA-BT-A20N-01A-11R-A14Y-07 TCGA-BT-A20N
## TCGA-BL-A13J-01A-11R-A277-07 TCGA-BL-A13J
## ... ...
## TCGA-BT-A2LB-11A-11R-A18C-07 TCGA-BT-A2LB
## TCGA-K4-A54R-11A-11R-A26T-07 TCGA-K4-A54R
## TCGA-GC-A3WC-11A-11R-A22U-07 TCGA-GC-A3WC
## TCGA-GC-A6I3-11A-11R-A31N-07 TCGA-GC-A6I3
## TCGA-GD-A2C5-11A-11R-A180-07 TCGA-GD-A2C5
##turn the GROUP and BLOCK variables to categorical variables
tcga$GROUP <- as.factor(tcga$GROUP)
tcga$BLOCK <- as.factor(tcga$BLOCK)
print("Look at the read counts of 4 genes for a 5 samples")
## [1] "Look at the read counts of 4 genes for a 5 samples"
(assays(tcga))$exprs[1:4,1:5]
## TCGA-K4-A3WV-01A-11R-A22U-07 TCGA-BT-A20W-01A-21R-A14Y-07
## 2 2133 26508
## 144568 1124 60
## 53947 2619 769
## 8086 3621 1914
## TCGA-K4-A5RI-01A-11R-A28M-07 TCGA-BT-A20N-01A-11R-A14Y-07
## 2 18641 3828
## 144568 264 1241
## 53947 2723 424
## 8086 2910 1239
## TCGA-BL-A13J-01A-11R-A277-07
## 2 23443
## 144568 1444
## 53947 544
## 8086 1217
##create a DESeq data object
dds.bc <- DESeqDataSet(tcga, design = ~ GROUP + BLOCK)
##estimate normalization/size-factors and dispersions
dds.bc %<>% DESeq(.)
##variance stabilizing transformation to view the normalize data
vsd.bc <- dds.bc %>%
vst(., blind=TRUE)
##generate the PCA plot using the normalized data. Note the clustering of the samples by the tumor versus normal comparisons
vsd.bc %>%
plotPCA(., intgroup=c("GROUP"))
##differential expression association for tumor versus normal differences controlling for patient specific differences
diff.res <- dds.bc %>%
results(., contrast = c("GROUP", "1", "0"), pAdjustMethod="bonferroni")
##visualize the results using a Volcano Plot
diff.res %>%
as.data.frame() %>%
EnhancedVolcano(.,
lab = rownames(.),
x = 'log2FoldChange',
y = 'padj',
xlim = c(-5, 8))
##output the results
diff.res %>%
as.data.frame() %>%
rownames_to_column('Gene') %>%
write.csv(., "bladder_cancer_diff_exp_results.csv", row.names = FALSE)
We will load the Gene Ontology and WikiPathways databases. Note, an additional database called PFOCR is also loaded. We will ignore this database during this workshop.
##load the pathway gene set data-bases
database_lists <- load("databases.RData")#has wp, pfocr, go
##WikiPathways annotation is a data frame that links genes (in terms of their Entrez IDs) to each of the WikiPathways (annotated by their names and IDs)
head(wp_annotation)
## name set_id gene
## 1 FABP4 in ovarian cancer WP4400 574413
## 2 FABP4 in ovarian cancer WP4400 2167
## 3 B Cell Receptor Signaling Pathway WP23 4690
## 4 B Cell Receptor Signaling Pathway WP23 5781
## 5 B Cell Receptor Signaling Pathway WP23 11184
## 6 B Cell Receptor Signaling Pathway WP23 6195
##WikiPathways list is a list of character vectors of Entrez IDs representing genes associated with each pathway
head(wp_list)
## $WP100
## [1] "728441" "91227" "290" "26873" "221357" "92086"
## [7] "3417" "2878" "2944" "2877" "2678" "2876"
## [13] "2953" "2687" "102724197" "2730" "2938" "2729"
## [19] "2937" "2936" "2946" "2539" "2879"
##
## $WP106
## [1] "189" "443" "2875" "445" "435" "18" "2572" "2571" "2806" "5091"
## [11] "2805" "1615"
##
## $WP107
## [1] "8893" "102466854" "8894" "8891" "101930123" "8892"
## [7] "8890" "29904" "1975" "2107" "1974" "1973"
## [13] "5610" "1938" "1937" "1936" "1979" "1978"
## [19] "1977" "1933" "8662" "8663" "8661" "8666"
## [25] "3692" "8667" "8664" "8665" "100302143" "1984"
## [31] "1983" "1981" "9669" "3646" "8672" "23708"
## [37] "9086" "27102" "7458" "1917" "8668" "8669"
## [43] "1915" "26986" "23277" "9451" "728689" "1965"
## [49] "1964" "10209" "10605" "8637" "1968" "1967"
##
## $WP111
## [1] "4694" "4695" "4696" "1340" "1337" "4728"
## [7] "4729" "27089" "513" "514" "55967" "4720"
## [13] "515" "516" "517" "4722" "4723" "518"
## [19] "4724" "4725" "4726" "1339" "1351" "1350"
## [25] "1349" "1347" "1346" "521" "1345" "522"
## [31] "29796" "4697" "4698" "4731" "93974" "9481"
## [37] "102465669" "27109" "4508" "4509" "498" "1355"
## [43] "1353" "539" "374291" "10476" "10632" "9377"
## [49] "7352" "7351" "9016" "6389" "7350" "4519"
## [55] "4512" "4513" "4514" "6391" "6390" "6392"
## [61] "10975" "9551" "4540" "4541" "10063" "6834"
## [67] "4535" "4536" "4537" "4538" "4539" "7385"
## [73] "7384" "7386" "9167" "100616403" "7388" "291"
## [79] "292" "293" "7381" "4705" "4706" "4707"
## [85] "4708" "4709" "4700" "4701" "4702" "4704"
## [91] "6341" "1327" "4716" "4717" "100500805" "4718"
## [97] "4719" "4710" "506" "4711" "4712" "4713"
## [103] "1329" "509" "4714" "4715"
##
## $WP117
## [1] "26716" "9620" "128674" "30817" "6752" "3363"
## [7] "54112" "154" "56413" "2149" "341416" "83873"
## [13] "27202" "23284" "51289" "3356" "3355" "138883"
## [19] "1815" "1814" "4923" "81050" "5737" "8387"
## [25] "5032" "57191" "53829" "9038" "1909" "26648"
## [31] "144124" "2911" "2833" "2798" "1268" "26245"
## [37] "887" "2918" "53831" "2837" "1901" "4935"
## [43] "9289" "9287" "29929" "9288" "4992" "393046"
## [49] "8390" "1880" "23266" "2692" "2492" "59340"
## [55] "8392" "254786" "6608" "26494" "401428" "1952"
## [61] "1951" "135" "3579" "2841" "1234" "3577"
## [67] "2840" "26333" "26212" "221395" "26211" "11245"
## [73] "341276" "2925" "283383" "64582" "100616112" "59352"
## [79] "29933" "1131" "27239" "9290" "140" "59350"
## [85] "118442" "1129" "10888" "84658" "79541" "84539"
## [91] "146" "2532" "84059" "4994"
##
## $WP12
## [1] "8792" "3690" "8111" "5155" "8600" "6696" "4982" "9550" "56302"
## [10] "1513" "3456" "3454" "7965" "5599" "6548" "54"
The input to this analyses is a list of genes of interest (here it would be the list of genes deemed differentially expressed between the tumor and normal samples) and also the universe of genes from which the former list of genes were derived.
We will use a function in the clusterProfiler library to perform this analysis.
##Choose set of differential expressed genes
##pick the differentially expressed genes using 0.05 threshold
diff_genes <- diff.res %>%
as.data.frame() %>%
rownames_to_column('gene') %>%
filter(padj < 0.05) %>%
.$gene
##important to pick the universe of genes. We will use all genes for which we have gene counts
universe_genes <- diff.res %>%
as.data.frame() %>%
rownames_to_column('gene') %>%
.$gene
##run the ORA analyses
res_ora <- enricher(
gene = diff_genes,
universe = universe_genes,
pAdjustMethod = "BH",
pvalueCutoff = 1, #p.adjust cutoff
qvalueCutoff = 1,
minGSSize = 1,
maxGSSize = 100000,
TERM2GENE = wp_annotation[,c("set_id","gene")],
TERM2NAME = wp_annotation[,c("set_id","name")])
res_ora <- res_ora@result
## view the first few rows of the results
head(res_ora)
## ID Description GeneRatio BgRatio
## WP2446 WP2446 Retinoblastoma Gene in Cancer 47/1022 86/4839
## WP466 WP466 DNA Replication 26/1022 41/4839
## WP2361 WP2361 Gastric Cancer Network 1 16/1022 22/4839
## WP179 WP179 Cell Cycle 48/1022 115/4839
## WP45 WP45 G1 to S cell cycle control 30/1022 61/4839
## WP289 WP289 Myometrial Relaxation and Contraction Pathways 46/1022 120/4839
## pvalue p.adjust qvalue
## WP2446 6.082885e-12 3.011028e-09 2.817336e-09
## WP466 4.881223e-09 1.208103e-06 1.130388e-06
## WP2361 2.877882e-07 4.208137e-05 3.937438e-05
## WP179 3.400515e-07 4.208137e-05 3.937438e-05
## WP45 9.280214e-07 9.187412e-05 8.596409e-05
## WP289 9.770180e-06 8.060399e-04 7.541894e-04
## geneID
## WP2446 25/54443/890/891/9133/898/9134/993/8318/8317/983/1017/1019/81620/1111/1786/1869/1870/2189/24137/4173/4175/4176/2956/4609/4998/5111/10733/5426/5427/5557/5591/5928/5947/5983/5984/5985/6119/6241/6502/10592/3925/7027/7153/7272/7298/7465
## WP466 8318/990/8317/1017/81620/10926/55388/4171/4173/4174/4175/4176/4998/23594/5111/23649/5424/5426/5427/5557/5558/5982/5983/5984/5985/6119
## WP2361 86/6790/1063/144455/1894/56992/9585/286826/4173/4605/57122/8607/64094/7153/22974/11065
## WP179 25/699/9184/890/891/9133/894/898/9134/991/993/995/8318/990/8317/983/1017/1019/1028/1111/11200/10926/1869/1870/9700/4616/2932/3066/10459/4171/4173/4174/4175/4176/4609/4998/23594/5111/9088/5347/5591/9232/5933/6502/7027/7043/7272/7465
## WP45 891/894/898/9134/993/8318/983/1017/1019/1028/90993/1869/1870/4171/4173/4174/4175/4176/4609/4998/23594/5111/23649/5426/5427/5557/5558/6119/7027/7465
## WP289 58/59/70/108/196883/111/115/408/467/489/800/817/1264/2353/2791/55970/54331/2788/2869/3488/3489/3569/3708/1902/23764/4846/5142/5144/11142/5331/5336/5577/5579/5590/10267/10266/10268/5996/8786/10287/5997/8490/8787/6262/6263/6546
## Count
## WP2446 47
## WP466 26
## WP2361 16
## WP179 48
## WP45 30
## WP289 46
#GeneRatio: Proportion of differentially expressed in each WikiPathway
#BgRatio: Proportion of all genes that are association with at least WikiPathway that is associated with each WikiPathway
##Estimate the odds ratio
#k: total number of differentially expressed genes annotated to at least one WikiPathway that are also part of each gene set
k <- sapply(res_ora$GeneRatio, function(x) as.numeric(strsplit(x, "/")[[1]][1]))
#n: total number of differentially expressed genes annotated to at least one WikiPathway
n <- sapply(res_ora$GeneRatio, function(x) as.numeric(strsplit(x, "/")[[1]][2]))
#M: total number of genes in each gene set
M <- sapply(res_ora$BgRatio, function(x) as.numeric(strsplit(x, "/")[[1]][1]))
#N: total number of genes assigned to at least one WikiPathway. Note, this number will be less than or equal to the total number of genes for which you have count data in the RNA-seq (gene expression) data set
N <- sapply(res_ora$BgRatio, function(x) as.numeric(strsplit(x, "/")[[1]][2]))
odds_ratio <- (k*(N-M-n+k))/((M-k)*(n-k))
res_ora %<>% mutate(odds_ratio=odds_ratio)
## view the first few rows of the results
head(res_ora)
## ID Description GeneRatio BgRatio
## 1 WP2446 Retinoblastoma Gene in Cancer 47/1022 86/4839
## 2 WP466 DNA Replication 26/1022 41/4839
## 3 WP2361 Gastric Cancer Network 1 16/1022 22/4839
## 4 WP179 Cell Cycle 48/1022 115/4839
## 5 WP45 G1 to S cell cycle control 30/1022 61/4839
## 6 WP289 Myometrial Relaxation and Contraction Pathways 46/1022 120/4839
## pvalue p.adjust qvalue
## 1 6.082885e-12 3.011028e-09 2.817336e-09
## 2 4.881223e-09 1.208103e-06 1.130388e-06
## 3 2.877882e-07 4.208137e-05 3.937438e-05
## 4 3.400515e-07 4.208137e-05 3.937438e-05
## 5 9.280214e-07 9.187412e-05 8.596409e-05
## 6 9.770180e-06 8.060399e-04 7.541894e-04
## geneID
## 1 25/54443/890/891/9133/898/9134/993/8318/8317/983/1017/1019/81620/1111/1786/1869/1870/2189/24137/4173/4175/4176/2956/4609/4998/5111/10733/5426/5427/5557/5591/5928/5947/5983/5984/5985/6119/6241/6502/10592/3925/7027/7153/7272/7298/7465
## 2 8318/990/8317/1017/81620/10926/55388/4171/4173/4174/4175/4176/4998/23594/5111/23649/5424/5426/5427/5557/5558/5982/5983/5984/5985/6119
## 3 86/6790/1063/144455/1894/56992/9585/286826/4173/4605/57122/8607/64094/7153/22974/11065
## 4 25/699/9184/890/891/9133/894/898/9134/991/993/995/8318/990/8317/983/1017/1019/1028/1111/11200/10926/1869/1870/9700/4616/2932/3066/10459/4171/4173/4174/4175/4176/4609/4998/23594/5111/9088/5347/5591/9232/5933/6502/7027/7043/7272/7465
## 5 891/894/898/9134/993/8318/983/1017/1019/1028/90993/1869/1870/4171/4173/4174/4175/4176/4609/4998/23594/5111/23649/5426/5427/5557/5558/6119/7027/7465
## 6 58/59/70/108/196883/111/115/408/467/489/800/817/1264/2353/2791/55970/54331/2788/2869/3488/3489/3569/3708/1902/23764/4846/5142/5144/11142/5331/5336/5577/5579/5590/10267/10266/10268/5996/8786/10287/5997/8490/8787/6262/6263/6546
## Count odds_ratio
## 1 47 4.669717
## 2 26 6.616600
## 3 16 10.102054
## 4 48 2.758283
## 5 30 3.693418
## 6 46 2.383944
res_ora %>%
write.csv(., "bladder_cancer_WikiPathways_ora.csv", row.names = FALSE)
These analyses require as input the (unadjusted) p-values associated with differential expression for each gene.
# ##get estimates of the overall proportion of genes asssociated with the tumor vs normal comparison
TDPestimate_full <- setTDP(diff.res$pvalue, universe_genes, alpha = 0.05)
TDPestimate_full
## $TDP.bound
## [1] 0.3419765
##
## $TDP.estimate
## [1] 0.5371005
##run rSEA method
res_rSEA <- SEA(diff.res$pvalue, universe_genes, pathlist = wp_list)
##add additional column named Name so that these results can be merged with the wp_annotation data frame
res_rSEA %<>% mutate(set_id=Name)
##get pathway names
wp_id_2_names <- wp_annotation %>%
select(1,2) %>%
unique()
res_rSEA %<>% merge(wp_id_2_names,.)
##View the first few rows of the results. Note: SC.adjP represents the adjusted p-value for the significance of self-contained null hypothesis while Comp.adjP represents the adjusted p-values for the significance of the competitive null hypothesis
res_rSEA %>%
dplyr::slice(order(Comp.adjP)) %>%
head()
## set_id name ID
## 1 WP1600 Nicotine Metabolism 38
## 2 WP2276 Glial Cell Differentiation 88
## 3 WP4030 SCFA and skeletal muscle substrate metabolism 332
## 4 WP334 GPCRs, Class B Secretin-like 210
## 5 WP1991 SRF and miRs in Smooth Muscle Differentiation and Proliferation 60
## 6 WP206 Fatty Acid Omega Oxidation 76
## Name Size Coverage TDP.bound TDP.estimate SC.adjP Comp.adjP
## 1 WP1600 6 0.17 1.0000000 1.0 3.395060e-24 3.395060e-24
## 2 WP2276 8 0.62 0.4000000 0.4 7.833862e-26 5.704479e-23
## 3 WP4030 6 0.33 0.5000000 0.5 7.990408e-20 7.990408e-20
## 4 WP334 24 0.17 0.5000000 0.5 3.065793e-21 1.609559e-18
## 5 WP1991 13 0.69 0.8888889 1.0 5.372582e-24 1.169009e-14
## 6 WP206 15 0.33 0.6000000 0.8 2.940210e-38 1.319281e-14
res_rSEA %>% dplyr::slice(order(Comp.adjP)) %>%
write.csv(., "bladder_cancer_WikiPathways_rSEA.csv", row.names = FALSE)
These analyses require as input the normalized expression matrix of gene expression across all genes over all the 38 samples. The estimation of the significance of the association of a given gene set with the tumor vs normal comparison is based on permutation of the sample (tumor or normal) labels per subject.
##We will use the GSEABenchmarkeR package to run this analyses. The function requires as input a list of SummarizedExperiment objects which includes additional rowData giving the differential expression results
tcga.de <- readRDS("bladder_cancer_tcga_summarized_experiment_w_de_results.rds")
##Note the function runEA takes the raw data, normalizes the expression data using the vst function in DESeq2 that generates the variance stabilized transformed normalized data which is then used as input to the SAFE method
##We will not run the analyses here because the 1000 permutations will take some time to complete
# res_safe_sample_perm <- runEA(tcga.de, method="safe", gs=wp_list, perm=1000)
# res_safe <- res_safe_sample_perm$safe[[1]]$ranking %>% as.data.frame()
# res_safe %<>% mutate(set_id=GENE.SET)
# res_safe %<>% merge(wp_id_2_names,.) %>% slice(order(PVAL))
# res_safe %>%
# write.csv(., "bladder_cancer_WikiPathways_safe_sample_perm.csv", row.names = FALSE)
##let us just read-in the results
res_safe <- read.csv("bladder_cancer_WikiPathways_safe_sample_perm.csv", header = TRUE)
##View the first few rows of the results
head(res_safe)
## set_id name
## 1 WP1991 SRF and miRs in Smooth Muscle Differentiation and Proliferation
## 2 WP2023 Cell Differentiation - Index expanded
## 3 WP1602 Nicotine Activity on Dopaminergic Neurons
## 4 WP2029 Cell Differentiation - Index
## 5 WP3996 Ethanol effects on histone modifications
## 6 WP497 Urea cycle and metabolism of amino groups
## GENE.SET GLOB.STAT NGLOB.STAT PVAL
## 1 WP1991 39300 4370 0.001
## 2 WP2023 43100 3920 0.001
## 3 WP1602 40000 3630 0.002
## 4 WP2029 25100 4180 0.007
## 5 WP3996 86700 3100 0.008
## 6 WP497 52400 3280 0.008
These analyses require as input the normalized expression matrix of gene expression across all genes over all the 38 samples. The estimation of the significance of the association of a given gene set with the tumor vs normal comparison is based on permutation of the sample (tumor or normal) labels per subject. This method includes the use of weights for each gene depending on its uniqueness to the gene set under consideration.
tcga.de <- readRDS("bladder_cancer_tcga_summarized_experiment_w_de_results.rds")
##Note the function runEA takes the raw data, normalizes the expression data using the vst function in DESeq2 that generates the variance stabilized transformed normalized data which is then used as input to the SAFE method
##We will not run the analyses here because the 1000 permutations will take some time to complete
# res_padog_sample_perm <- runEA(tcga.de, method="padog", gs=wp_list, perm=1000)
# res_padog <- res_padog_sample_perm$padog[[1]]$ranking %>% as.data.frame()
# res_padog %<>% mutate(set_id=GENE.SET)
# res_padog %<>% merge(wp_id_2_names,.) %>% slice(order(PVAL))
# res_padog %>%
# write.csv(., "bladder_cancer_WikiPathways_padog_sample_perm.csv", row.names = FALSE)
##let us just read-in the results
res_padog <- read.csv("bladder_cancer_WikiPathways_padog_sample_perm.csv", header = TRUE)
##View the first few rows of the results
head(res_padog)
## set_id name
## 1 WP2023 Cell Differentiation - Index expanded
## 2 WP2029 Cell Differentiation - Index
## 3 WP1991 SRF and miRs in Smooth Muscle Differentiation and Proliferation
## 4 WP2355 Corticotropin-releasing hormone signaling pathway
## 5 WP2361 Gastric Cancer Network 1
## 6 WP4300 Extracellular vesicles in the crosstalk of cardiac cells
## GENE.SET MEAN.ABS.T0 PADOG0 P.MEAN.ABS.T PVAL
## 1 WP2023 4.65 3.970 0.00200 0.00001
## 2 WP2029 4.49 3.780 0.00001 0.00001
## 3 WP1991 5.77 4.980 0.00200 0.00100
## 4 WP2355 2.65 0.539 0.01600 0.00500
## 5 WP2361 5.83 5.940 0.02000 0.01200
## 6 WP4300 2.98 1.790 0.02500 0.01300
These analyses require as input the normalized expression matrix of gene expression across all genes over all the 38 samples. The estimation of the significance of the association of a given gene set with the tumor vs normal comparison is based on permutation of the sample (tumor or normal) labels per subject.
tcga.de <- readRDS("bladder_cancer_tcga_summarized_experiment_w_de_results.rds")
##Note the function runEA takes the raw data, normalizes the expression data using the vst function in DESeq2 that generates the variance stabilized transformed normalized data which is then used as input to the SAFE method
##We will not run the analyses here because the 1000 permutations will take some time to complete
# res_gsea_sample_perm <- runEA(tcga.de, method="gsea", gs=wp_list, perm=1000)
# res_gsea <- res_gsea_sample_perm$gsea[[1]]$ranking %>% as.data.frame()
# res_gsea %<>% mutate(set_id=GENE.SET)
# res_gsea %<>% merge(wp_id_2_names,.) %>% slice(order(PVAL))
# res_gsea %>%
# write.csv(., "bladder_cancer_WikiPathways_gsea_sample_perm.csv", row.names = FALSE)
##let us just read-in the results
res_gsea <- read.csv("bladder_cancer_WikiPathways_gsea_sample_perm.csv", header = TRUE)
##View the first few rows of the results
head(res_gsea)
## set_id
## 1 WP289
## 2 WP3414
## 3 WP706
## 4 WP98
## 5 WP3981
## 6 WP536
## name
## 1 Myometrial Relaxation and Contraction Pathways
## 2 Initiation of transcription and translation elongation at the HIV-1 LTR
## 3 Sudden Infant Death Syndrome (SIDS) Susceptibility Pathways
## 4 Prostaglandin Synthesis and Regulation
## 5 miRNA regulation of prostate cancer signaling pathways
## 6 Calcium Regulation in the Cardiac Cell
## GENE.SET ES NES PVAL
## 1 WP289 -0.593 -1.73 0.00000
## 2 WP3414 -0.584 -1.75 0.00000
## 3 WP706 -0.521 -1.74 0.00000
## 4 WP98 -0.667 -1.65 0.00000
## 5 WP3981 -0.528 -1.76 0.00191
## 6 WP536 -0.561 -1.70 0.00192
These analyses require as input a score for each gene. The larger the absolute value of the score for a gene is the more the evidence of the strength of the association of the expression of the gene with the tumor vs normal comparison. The estimation of the significance of the association of a given gene set with the tumor vs normal comparison is based on permutation of the gene labels.
##generate a score for each gene that is equal to -log10(pvalue) in absolute value and whose sign is equal to that of the log FC - positive for up-regulated genes while negative for down-regulated genes
gene_list <- diff.res %>%
as.data.frame() %>%
rownames_to_column('Gene') %>%
mutate(Score = sign(as.numeric(log2FoldChange)) * - log10(as.numeric(as.character(pvalue)))) %>%
select(c("Score","Gene")) %>%
arrange(desc(Score))
gene_list <- unlist(split(gene_list[, 1], gene_list[, 2]))
gene_list = sort(gene_list[unique(names(gene_list))], decreasing = TRUE)
head(gene_list)
## 4320 54058 55083 7516 1301 6493
## 31.86522 25.39709 25.11739 24.17312 23.57965 22.07034
tail(gene_list)
## 5348 286133 7123 1675 146556 221476
## -49.63643 -51.47116 -52.89205 -57.51413 -67.19160 -86.68870
##run the gene perm version of gsea
res_gsea_gene_perm <- clusterProfiler::GSEA(
gene_list,
pAdjustMethod="BH",
TERM2GENE = wp_annotation[,c("set_id","gene")],
TERM2NAME = wp_annotation[,c("set_id","name")] ,
minGSSize = 1,
maxGSSize = 100000,
pvalueCutoff = 1,
verbose=FALSE)
res_gsea_gene_perm <- res_gsea_gene_perm@result
##view the first few rows of the results
head(res_gsea_gene_perm)
## ID Description setSize
## WP3888 WP3888 VEGFA-VEGFR2 Signaling Pathway 403
## WP4172 WP4172 PI3K-Akt Signaling Pathway 252
## WP3932 WP3932 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway 243
## WP2882 WP2882 Nuclear Receptors Meta-Pathway 222
## WP382 WP382 MAPK Signaling Pathway 195
## WP306 WP306 Focal Adhesion 173
## enrichmentScore NES pvalue p.adjust qvalues rank
## WP3888 -0.4627917 -1.492921 0.001041667 0.03638752 0.02744793 2506
## WP4172 -0.5049345 -1.588244 0.001088139 0.03638752 0.02744793 2187
## WP3932 -0.5152809 -1.616262 0.001094092 0.03638752 0.02744793 2187
## WP2882 -0.5290426 -1.648267 0.001116071 0.03638752 0.02744793 1655
## WP382 -0.5817051 -1.798170 0.001129944 0.03638752 0.02744793 1531
## WP306 -0.5635541 -1.724448 0.001144165 0.03638752 0.02744793 2687
## leading_edge
## WP3888 tags=29%, list=20%, signal=24%
## WP4172 tags=31%, list=18%, signal=26%
## WP3932 tags=29%, list=18%, signal=24%
## WP2882 tags=22%, list=13%, signal=19%
## WP382 tags=23%, list=12%, signal=21%
## WP306 tags=40%, list=22%, signal=32%
## core_enrichment
## WP3888 6154/9209/2185/5867/301/468/6129/5970/2746/4673/2887/6595/1466/2022/6093/4641/23291/9444/3688/57758/9261/5784/4303/3142/9734/26058/6778/4855/5170/3791/8828/5578/6461/4773/10628/154796/4792/6648/4790/1432/25759/5803/2152/355/80031/3490/5587/8665/27289/5743/23189/4736/3725/51574/5908/152273/2549/6125/7414/5906/1397/2078/9252/4846/5563/11080/6886/6386/4772/22943/2308/56999/25/2534/6546/9365/781/3690/57326/154/1847/7220/4629/1839/9759/6401/5592/51309/1003/326624/6347/596/1465/1901/1827/32/10014/22899/114789/857/91624/84952/274/6722/4208/9079/1958/5579/7111/1960/9510/4929/57381/7148/10231/8013/81575/3164
## WP4172 5728/9180/3716/55012/5170/57521/7424/3791/5578/3672/4254/1299/10161/3566/1436/7057/4790/1435/3574/3910/2323/10681/1026/7099/6446/4170/55970/3717/1975/7450/54331/4846/5563/284/7010/1291/4609/1288/5156/10000/3570/3690/4915/3913/3678/3563/5521/3680/90993/9586/5516/3815/1292/3082/3479/5525/80310/2260/3679/1286/627/596/1902/894/2690/10319/2258/3908/1440/2252/4804/2247/5649/2788/2791/8516/3569/7148
## WP3932 5728/9180/3716/55012/5170/57521/7424/3791/4254/2034/10161/3566/8660/1436/7057/51719/23216/1435/3910/81617/1026/55970/3717/1975/7450/54331/4846/5563/2308/284/7010/1288/5156/10000/3570/3690/3913/3678/3563/5521/3680/6515/90993/9586/5516/3815/1292/3082/3479/5525/80310/2260/3679/1286/1902/2690/10319/2258/10891/3908/1440/2252/4804/64344/2247/5649/2788/6517/2791/8516/7148
## WP2882 4240/8824/4780/60482/6594/5743/3725/5465/11214/34/2040/2308/4609/8850/5552/2908/2099/80315/3726/5244/6515/4616/1028/1066/1839/5243/330/3082/10486/23764/89795/6347/2258/10891/2289/1831/2042/2949/7049/5142/1958/6649/7048/3727/6517/10252/2878/5997/5166
## WP382 355/1845/1844/4137/3725/5908/5602/55970/5906/9252/4772/775/7043/4609/10000/781/4915/6237/408/785/783/2260/1850/1326/5532/627/3306/9020/2316/8912/2258/2252/5533/6722/4208/120892/2247/10235/2353/8605/7048/2318/3727/1843/3164
## WP306 5159/2268/858/387/2889/7409/9475/7423/3915/7408/6093/3688/1793/5728/394/5170/7424/83660/3791/5578/5500/3672/2909/7057/60/4659/25759/3910/23396/54776/87/3725/5908/5602/7414/5906/7450/7791/7094/1288/2534/3611/55742/5156/10000/3690/3913/3678/3680/1292/330/3082/3479/80310/3679/1286/596/894/2316/10319/3908/857/4660/10398/4638/5649/2318/5579/8516/7148
res_gsea_gene_perm %>%
write.csv(., "bladder_cancer_WikiPathways_gsea_gene_perm.csv", row.names = FALSE)