This document shows the general steps used to run the tests true positive rates (TPR) in the PANOPLY manuscript. The steps use the covariance structure or RNA expression from breast normal tissues to simulate random multivariate normal expression data for a case and a set of matched controls, where the case had higher mean expression in a subsest of genes that are in cancer networks targeted by a specific drug. To assess the impact of having DNA events as the cancer drivers of genomic networks that are targets of cancer drugs, we include DNA copy number events in specific driver genes and at random. The results from the complete simulations of 500 datasets (nsim=500) from each setting are summarized in the PANOPLY manuscript and supplementary methods.
The first code chunk describes how we created the document using R-markdown, and loads the necessary R packages.
## to make this into a word document:
## library(rmarkdown); render(filename.Rmd", output_format="word_document")
options(stringsAsFactors=FALSE,width=140)
library(knitr) ## contains kable
library(MASS) ## contains mvrnorm
library(panoply)
Next, load the datasets from the PANOPLY package, and we create a covariance matrix from a normalized TCGA gene expression dataset that we will use to simulate random multivariate normal gene counts. The covariance matrix used in the initial manuscript for PANOPLY can be downloaded from http://kalarikrlab.org/Software/Panoply.html
data(gcPanCO)
covGC <- cov(t(gcPanCO))
## load("covSOLID.RData")
## covGC <- covSOLID
meanGC <- rep(0, ncol(covGC))
names(meanGC) <- colnames(covGC)
data(genelistPan)
data(gcinfoPan)
data(dgidbPan)
data(reactome)
data(dgiSets)
The PANOPLY workflow allows for various settings to test cancer driver genes and cancer drugs. We first give settings that were used to control our published simulations, which includes the number of matched controls, the true positive drug name and associated gene drivers and network genes, the over-expression level to induce for those network genes in the case, and the number of DNA driver genes to simulate. Below are these variables with a short description in the comment on the same line.
## TPR drug/genes
pickDrug <- "OLAPARIB" ## Drug considered the "true positive"
cnGenes <- c("ATR","BLM","BRCA1","CHEK2") ## cancer driver genes for Olaparib
npt <- 9 ## 1 case and 8 matched controls
sdExpress <- 2 ## gene-specific over-expression level in the case (sd of genes)
n.dna.events <- 30 ## in addition to RNA drivers, number of random dna drivers (of total 429)
We define a set of genes that are differentially expressed because of a variant or copy number event in the driver genes set above (cnGenes) that are involved in cancer genomic networks with multiple genes targeted by Olaparib. Below, we select a set of 27 genes to be over-expressed because of a genomic event, which is based directly on the gene set annotation set curated for PANOPLY.
brca1 <- colnames(reactome.adj)[reactome.adj["BRCA1",]>0]
chek2 <- colnames(reactome.adj)[reactome.adj["CHEK2",]>0]
atr <- colnames(reactome.adj)[reactome.adj["ATR",]>0]
blm <- colnames(reactome.adj)[reactome.adj["BLM",]>0]
alltbl <- table(c(brca1, chek2,atr,blm))
pickNetGenes <- names(alltbl)[alltbl>1]
kable(data.frame(Genes.Olaparib=pickNetGenes))
Genes.Olaparib |
---|
ATM |
BARD1 |
BRCA1 |
BRCA2 |
CHEK2 |
FANCA |
FANCC |
FANCD2 |
FANCE |
FANCF |
FANCG |
FANCI |
FANCL |
H2AFX |
KAT5 |
MRE11A |
NBN |
RAD21 |
RAD50 |
RBBP8 |
REC8 |
SMC1A |
SMC3 |
STAG1 |
STAG2 |
SYCP2 |
TP53 |
These are algorithm settings for when running the driver gene and drug tests, which are arguments used in the two main functions: panGeneSets and panDrugSets. We list them below with a brief description in the comments.
set.tailPct <- 0.2 ## pct of RNA genes in case (upper or lower tails) as "drivers" in DMT
set.tailEnd <- "upper" ## direction of tail as RNA drivers (upper, lower, or both)
set.eventOnly=TRUE ## in DMT, only use RNA+DNA driver events, if FALSE, uses all 429
set.nsim=100 ## n-simulations for DMT p-value. We suggest using 1000 or more, but fewer here for run-time.
We first ignore the DNA driver events and only analyze the RNA over-expressed genes for the simulated “case” as the driver events. We show the DNT and DMT results for a small number of case-control datasets results for Olaparib, sorted by the P.Score for each simulation. We see a moderate number of p-values significant at 0.05 for DMT, and fewer from DNT.
meanGC <- rep(0, ncol(covGC))
names(meanGC) <- colnames(covGC)
meanGCcase <- meanGC
sdDrGenes <- sqrt(diag(covGC[pickNetGenes,pickNetGenes])) * sdExpress
meanGCcase[pickNetGenes] <- 0 + sdDrGenes
set.seed(2000)
gcSim1 <- mvrnorm(1, mu=meanGCcase, Sigma=covGC)
gcSim8 <- mvrnorm(npt-1, mu=meanGC, Sigma=covGC)
gcSim <- t(rbind(gcSim1, gcSim8))
colnames(gcSim) <- paste("Null",1:npt,sep="_")
i <- colnames(gcSim)[1]
patient <- i
ptmatch <- colnames(gcSim)[!(colnames(gcSim) %in% i)]
driversRNA <- panGeneSets(caseids=patient,controlids=ptmatch,gcount=gcSim,
tailPct=set.tailPct, tailEnd=set.tailEnd, eventOnly=set.eventOnly)
kable(driversRNA[1:10,1:6], row.names=FALSE, caption="Differentially-Expressed Driver Gene Sets, RNA drivers")
Cancer.Gene | Network | Network.pval | Network.mean | Druggable | Total.Druggable |
---|---|---|---|---|---|
BARD1 | 10 | 0.0009066 | 3.119244 | 0 | 0 |
BRCA1 | 55 | 0.0123747 | 2.245293 | 1 | 5 |
KL | 17 | 0.0136933 | 2.205962 | 0 | 0 |
ATM | 37 | 0.0277795 | 1.914478 | 1 | 4 |
FGF10 | 40 | 0.0301957 | 1.877925 | 0 | 3 |
CHEK2 | 25 | 0.0321094 | 1.850658 | 1 | 4 |
FANCM | 14 | 0.0348612 | 1.813710 | 0 | 0 |
BRCA2 | 4 | 0.0407196 | 1.742396 | 1 | 3 |
RAB3A | 12 | 0.0556434 | 1.592436 | 0 | 0 |
FANCD2 | 6 | 0.0605901 | 1.549839 | 0 | 1 |
drugRNA <- panDrugSets(driversRNA,caseids=patient, controlids=ptmatch,
gcount=gcSim, gene.gs=reactome.gs,
gene.adj=reactome.adj, drug.gs=dgi.gs, drug.adj=dgi.adj,
minPathways=0, nsim = set.nsim, tailEnd = set.tailEnd)
kable(drugRNA[1:15,1:10], row.names=FALSE, digits=3, caption="Drug Test Results, RNA-only")
Drug | N.Cancer.Genes | Cancer.Genes | N.Network.Genes | Network.Genes | Network | DNT.pval | DMT.Stat | DMT.pval | PScore |
---|---|---|---|---|---|---|---|---|---|
ARN-509 | 1 | AR | 1 | AR | 1 | 0.000 | 5.136 | 0.21 | 4.884 |
DROMOSTANOLONE | 1 | AR | 1 | AR | 1 | 0.000 | 5.136 | 0.21 | 4.884 |
METHYLTESTOSTERONE | 1 | AR | 1 | AR | 1 | 0.000 | 5.136 | 0.21 | 4.884 |
BICALUTAMIDE | 1 | AR | 1 | AR | 5 | 0.000 | 3.632 | 0.21 | 4.762 |
ANASTROZOLE | 1 | ESR1 | 2 | ESR1,CYP19A1 | 6 | 0.000 | 2.793 | 0.11 | 4.670 |
KETOCONAZOLE | 1 | AR | 2 | AR,CYP19A1 | 22 | 0.000 | 3.010 | 0.21 | 4.123 |
NILUTAMIDE | 1 | AR | 1 | AR | 4 | 0.000 | 3.632 | 0.21 | 4.113 |
BMN673 | 3 | BRCA1,ATM,BRCA2 | 3 | BRCA1,ATM,BRCA2 | 4 | 0.003 | 3.314 | 0.03 | 4.088 |
RUCAPARIB | 3 | BRCA1,ATM,BRCA2 | 3 | BRCA1,ATM,BRCA2 | 4 | 0.003 | 3.314 | 0.03 | 4.088 |
VELIPARIB | 3 | BRCA1,ATM,BRCA2 | 3 | BRCA1,ATM,BRCA2 | 4 | 0.003 | 3.314 | 0.03 | 4.088 |
AMINOGLUTETHIMIDE | 0 | 1 | CYP19A1 | 2 | 0.000 | -0.055 | 0.82 | 4.069 | |
TALAMPANEL | 0 | 0 | 4 | 0.000 | 0.000 | 1.00 | 3.595 | ||
ENZALUTAMIDE | 1 | AR | 1 | AR | 10 | 0.002 | 3.632 | 0.21 | 3.499 |
OLAPARIB | 3 | BRCA1,ATM,BRCA2 | 3 | BRCA1,ATM,BRCA2 | 5 | 0.011 | 3.314 | 0.03 | 3.493 |
LETROZOLE | 1 | ESR1 | 2 | ESR1,CYP19A1 | 5 | 0.004 | 2.793 | 0.11 | 3.317 |
We now add DNA driver events to “turn on” the neighboring genes of cancer driver genes. Since internally the functions treat variant and copy number events equally as an “event”, we simplify by simulating a copy number event. We simulate 30 random genes, plus four specific genes that have the over-expressed Olaparib-specific genes in their networks, which are ATM, CHEK2, BRCA2, and BLM.
set.seed(2000)
colnames(gcSim) <- paste("Null",1:npt,sep="_")
i <- colnames(gcSim)[1]
patient <- i
ptmatch <- colnames(gcSim)[!(colnames(gcSim) %in% i)]
cnSim <- data.frame(CHROM=1,Gene.Symbol=rownames(reactome.adj),
matrix(0, nrow=nrow(reactome.adj), ncol=npt))
colnames(cnSim)[3:ncol(cnSim)] <- c(patient, ptmatch);
cnSim[sample(nrow(reactome.adj),n.dna.events),patient] <- 1
cnSim[cnSim$Gene.Symbol %in% cnGenes,patient] <- 1
drivRNADNA <- panGeneSets(caseids=patient,controlids=ptmatch,
gene.adj=reactome.adj, drug.adj=dgi.adj,
gcount=gcSim,cna=cnSim,tailPct=set.tailPct,
tailEnd=set.tailEnd, eventOnly=set.eventOnly)
kable(drivRNADNA[1:10,1:6], row.names=FALSE, caption="Differentially-Expressed Driver Gene Sets from RNA and DNA drivers")
Cancer.Gene | Network | Network.pval | Network.mean | Druggable | Total.Druggable |
---|---|---|---|---|---|
ATR | 49 | 0.0003531 | 3.387170 | 1 | 5 |
BARD1 | 10 | 0.0009066 | 3.119244 | 0 | 0 |
BLM | 21 | 0.0020327 | 2.873043 | 0 | 3 |
BRCA1 | 55 | 0.0123747 | 2.245293 | 1 | 5 |
KL | 17 | 0.0136933 | 2.205962 | 0 | 0 |
GLI1 | 31 | 0.0172074 | 2.115178 | 0 | 0 |
ATM | 37 | 0.0277795 | 1.914478 | 1 | 4 |
FGF10 | 40 | 0.0301957 | 1.877925 | 0 | 3 |
CHEK2 | 25 | 0.0321094 | 1.850658 | 1 | 4 |
FANCM | 14 | 0.0348612 | 1.813710 | 0 | 0 |
drugRNADNA <- panDrugSets(drivRNADNA,caseids=patient, controlids=ptmatch,
gcount=gcSim, gene.gs=reactome.gs, gene.adj=reactome.adj,
drug.gs=dgi.gs, drug.adj=dgi.adj,
minPathways=0, nsim=set.nsim, tailEnd=set.tailEnd)
kable(drugRNADNA[1:15,1:10], row.names=FALSE, digits=3, caption="Drug Test Results")
Drug | N.Cancer.Genes | Cancer.Genes | N.Network.Genes | Network.Genes | Network | DNT.pval | DMT.Stat | DMT.pval | PScore |
---|---|---|---|---|---|---|---|---|---|
ARN-509 | 1 | AR | 1 | AR | 1 | 0.000 | 1.372 | 0.31 | 4.715 |
DROMOSTANOLONE | 1 | AR | 1 | AR | 1 | 0.000 | 1.372 | 0.31 | 4.715 |
METHYLTESTOSTERONE | 1 | AR | 1 | AR | 1 | 0.000 | 1.372 | 0.31 | 4.715 |
AMINOGLUTETHIMIDE | 0 | 1 | CYP19A1 | 2 | 0.000 | 2.560 | 0.20 | 4.682 | |
ANASTROZOLE | 1 | ESR1 | 2 | ESR1,CYP19A1 | 6 | 0.000 | 2.974 | 0.13 | 4.598 |
BICALUTAMIDE | 1 | AR | 1 | AR | 5 | 0.000 | 0.970 | 0.31 | 4.593 |
BMN673 | 4 | ATR,BRCA1,ATM,BRCA2 | 3 | BRCA1,BRCA2,ATM | 4 | 0.003 | 4.084 | 0.01 | 4.566 |
RUCAPARIB | 4 | ATR,BRCA1,ATM,BRCA2 | 3 | BRCA1,BRCA2,ATM | 4 | 0.003 | 4.084 | 0.01 | 4.566 |
VELIPARIB | 4 | ATR,BRCA1,ATM,BRCA2 | 3 | BRCA1,BRCA2,ATM | 4 | 0.003 | 4.084 | 0.01 | 4.566 |
KETOCONAZOLE | 1 | AR | 2 | AR,CYP19A1 | 22 | 0.000 | 1.449 | 0.29 | 3.983 |
OLAPARIB | 4 | ATR,BRCA1,ATM,BRCA2 | 3 | BRCA1,BRCA2,ATM | 5 | 0.011 | 4.084 | 0.01 | 3.970 |
NILUTAMIDE | 1 | AR | 1 | AR | 4 | 0.000 | 0.970 | 0.31 | 3.944 |
ATAMESTANE | 0 | 1 | CYP19A1 | 1 | 0.001 | 2.560 | 0.20 | 3.848 | |
TESTOLACTONE | 0 | 1 | CYP19A1 | 1 | 0.001 | 2.560 | 0.20 | 3.848 | |
TALAMPANEL | 0 | 0 | 4 | 0.000 | 0.000 | 1.00 | 3.595 |
Next we provide an example for how a subset of genes, e.g., a panel set, could be used in PANOPLY. We generate a random subset of 1000 network genes, and make a new adjacency matrix from those based on the adjacenty matrix we provided. However, a user can provide their own adjacency matrix to be just like the one we created. To make sure to get sufficient mixture of driver genes, we now set eventOnly=FALSE, which means to test all remaining driver gene networks in the panGeneSets function, which are then used in panDrugSets.
set.seed(1000)
panel.genes <- sample(colnames(reactome.adj), size=1000, replace=FALSE)
panel.adj <- reactome.adj[,panel.genes]
panel.adj <- panel.adj[rowSums(panel.adj)>0,]
dim(panel.adj)
## [1] 321 1000
drivPanel <- panGeneSets(caseids=patient, controlids=ptmatch,
gcount=gcSim, cna=cnSim, tailPct=set.tailPct,
gene.adj=panel.adj, drug.adj=dgi.adj,
tailEnd=set.tailEnd, eventOnly=FALSE)
kable(drivPanel[1:10,1:6], digits=3, row.names=FALSE, caption="DE Driver Gene Sets from Panel Set, RNA + DNA")
Cancer.Gene | Network | Network.pval | Network.mean | Druggable | Total.Druggable |
---|---|---|---|---|---|
KL | 2 | 0.003 | 2.799 | 0 | 0 |
CD274 | 3 | 0.024 | 1.970 | 0 | 1 |
SMARCB1 | 1 | 0.037 | 1.792 | 0 | 1 |
FANCL | 2 | 0.040 | 1.753 | 0 | 0 |
SPTA1 | 1 | 0.050 | 1.643 | 0 | 0 |
CHEK1 | 1 | 0.053 | 1.620 | 1 | 1 |
FANCM | 1 | 0.053 | 1.620 | 0 | 0 |
ATR | 9 | 0.058 | 1.575 | 1 | 2 |
PARK2 | 1 | 0.080 | 1.404 | 0 | 0 |
BRCA1 | 11 | 0.083 | 1.387 | 1 | 3 |
drugPanel <- panDrugSets(drivPanel, caseids=patient, controlids=ptmatch,
gcount=gcSim, gene.gs=reactome.gs, gene.adj=panel.adj,
drug.gs=dgi.gs, drug.adj=dgi.adj, minPathways=0,
nsim=set.nsim, tailEnd=set.tailEnd)
kable(drugPanel[1:10,1:10], digits=3, row.names=FALSE, caption="Drug Test Results from Gene Panel")
Drug | N.Cancer.Genes | Cancer.Genes | N.Network.Genes | Network.Genes | Network | DNT.pval | DMT.Stat | DMT.pval | PScore |
---|---|---|---|---|---|---|---|---|---|
ANASTROZOLE | 1 | ESR1 | 2 | ESR1,CYP3A4 | 6 | 0.000 | 2.816 | 0.06 | 4.934 |
BICALUTAMIDE | 1 | AR | 2 | CYP3A4,AR | 5 | 0.000 | 3.717 | 0.18 | 4.829 |
ARN-509 | 1 | AR | 1 | AR | 1 | 0.000 | 0.000 | 1.00 | 4.206 |
DROMOSTANOLONE | 1 | AR | 1 | AR | 1 | 0.000 | 0.000 | 1.00 | 4.206 |
METHYLTESTOSTERONE | 1 | AR | 1 | AR | 1 | 0.000 | 0.000 | 1.00 | 4.206 |
KETOCONAZOLE | 1 | AR | 2 | CYP3A4,AR | 22 | 0.000 | 3.717 | 0.18 | 4.190 |
AMINOGLUTETHIMIDE | 0 | 0 | 2 | 0.000 | 0.000 | 1.00 | 3.983 | ||
IMATINIB | 8 | NTRK1,CSF1R,ABL1,BCR,RET,PDGFRA,KIT,PDGFRB | 2 | NTRK1,CYP3A4 | 19 | 0.004 | 1.646 | 0.03 | 3.891 |
TALAMPANEL | 0 | 0 | 4 | 0.000 | 0.000 | 1.00 | 3.595 | ||
LETROZOLE | 1 | ESR1 | 2 | ESR1,CYP3A4 | 5 | 0.004 | 2.816 | 0.06 | 3.580 |
Below we give a brief explanation of how we picked gene sets to be targeted by a single drug, Olaparib, in the initial manuscript for PANOPLY.
The genes in TPR simulation set A are the direct targets of the drug Olaparib, which is annotated and used within PANOPLY. The genes targeted by any drug can be found using the same two lines of code.
pickGenesA <- dgi.gs[[pickDrug]]
pickGenesA <- pickGenesA[pickGenesA %in% colnames(reactome.adj)]
kable(data.frame(Genes.SetA=pickGenesA))
Genes.SetA |
---|
ATR |
ATM |
BRCA1 |
BRCA2 |
PARP1 |
The genes in TPR simulation set B are the genes annotated as directly affected by BRCA2, which is a target of Olaparib. So the four genes being over-expressed is like BRCA2 being turned on, causing the over-expression, and that Olaparib would target that sub-network, including BRCA2 itself. The set defined directly from the gene adjacency matrix.
pickGenesB <- names(which(reactome.adj["BRCA2",]>0))
kable(data.frame(Genes.SetB=pickGenesB))
Genes.SetB |
---|
BRCA2 |
ESR1 |
RAD51 |
RAD51C |
The genes selected for set C is not as easily defined from annotation as sets A and B above. We aimed to have a gene set of similar size as set A (5 total), but also affecting downstream genes within multiple sub-networks that are targeted by Olaparib. We show below the genes in set C (7 total), followed by the cancer driver genes that are upstream of three or more of the set C genes.
pickGenesC <- c("BRCA1","RAD21","FANCG","FANCI","SUMO2","H2AFX","ATM")
kable(data.frame(Genes.SetC=pickGenesC))
Genes.SetC |
---|
BRCA1 |
RAD21 |
FANCG |
FANCI |
SUMO2 |
H2AFX |
ATM |
DriversC=names(which(rowSums(reactome.adj[,pickGenesC])>2))
kable(data.frame(Drivers.SetC=DriversC), caption="Cancer Driver Genes that target 3 or more genes from set C")
Drivers.SetC |
---|
ATM |
ATR |
BRCA1 |
CHEK2 |
RAD51C |
For reproducibility, we provide session information.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.9 (Final)
##
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.2.1
## LAPACK: /usr/lib64/atlas/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=C LC_MONETARY=en_US.UTF-8
## [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.20 panoply_0.981 RColorBrewer_1.1-2 randomForest_4.6-12 Rgraphviz_2.22.0 graph_1.56.0
## [7] BiocGenerics_0.24.0 circlize_0.4.2 gage_2.28.0 MASS_7.3-47
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.15 highr_0.6 formatR_1.5 pillar_1.1.0 compiler_3.4.2 XVector_0.18.0
## [7] tools_3.4.2 zlibbioc_1.24.0 digest_0.6.12 bit_1.1-14 evaluate_0.10.1 RSQLite_2.0
## [13] memoise_1.1.0 tibble_1.4.2 png_0.1-7 rlang_0.1.6 DBI_0.8 httr_1.3.1
## [19] stringr_1.3.0 Biostrings_2.46.0 S4Vectors_0.16.0 GlobalOptions_0.0.12 IRanges_2.12.0 stats4_3.4.2
## [25] bit64_0.9-7 Biobase_2.38.0 R6_2.2.2 AnnotationDbi_1.40.0 blob_1.1.0 magrittr_1.5
## [31] KEGGREST_1.18.0 mime_0.5 shape_1.4.3 colorspace_1.3-2 stringi_1.1.7 markdown_0.8