Introduction

This document shows the general steps used to run the tests true positive rates (TPR) in the PANOPLY manuscript. The steps use the covariance structure or RNA expression from breast normal tissues to simulate random multivariate normal expression data for a case and a set of matched controls, where the case had higher mean expression in a subsest of genes that are in cancer networks targeted by a specific drug. To assess the impact of having DNA events as the cancer drivers of genomic networks that are targets of cancer drugs, we include DNA copy number events in specific driver genes and at random. The results from the complete simulations of 500 datasets (nsim=500) from each setting are summarized in the PANOPLY manuscript and supplementary methods.

Setup

The first code chunk describes how we created the document using R-markdown, and loads the necessary R packages.

## to make this into a word document:
## library(rmarkdown); render(filename.Rmd", output_format="word_document")
options(stringsAsFactors=FALSE,width=140)
library(knitr) ## contains kable
library(MASS)  ## contains mvrnorm
library(panoply)

Load Data

Next, load the datasets from the PANOPLY package, and we create a covariance matrix from a normalized TCGA gene expression dataset that we will use to simulate random multivariate normal gene counts. The covariance matrix used in the initial manuscript for PANOPLY can be downloaded from http://kalarikrlab.org/Software/Panoply.html

data(gcPanCO)
covGC <- cov(t(gcPanCO))

## load("covSOLID.RData")
## covGC <- covSOLID
meanGC <- rep(0, ncol(covGC))
names(meanGC) <- colnames(covGC)

data(genelistPan)
data(gcinfoPan)
data(dgidbPan)
data(reactome)
data(dgiSets)

Simulation Settings

The PANOPLY workflow allows for various settings to test cancer driver genes and cancer drugs. We first give settings that were used to control our published simulations, which includes the number of matched controls, the true positive drug name and associated gene drivers and network genes, the over-expression level to induce for those network genes in the case, and the number of DNA driver genes to simulate. Below are these variables with a short description in the comment on the same line.

## TPR drug/genes
pickDrug <- "OLAPARIB"                    ## Drug considered the "true positive" 
cnGenes <- c("ATR","BLM","BRCA1","CHEK2") ## cancer driver genes for Olaparib

npt <- 9           ## 1 case and 8 matched controls
sdExpress <- 2     ## gene-specific over-expression level in the case (sd of genes)
n.dna.events <- 30 ## in addition to RNA drivers, number of random dna drivers (of total 429)

Olaparib Gene Set for Simulations

We define a set of genes that are differentially expressed because of a variant or copy number event in the driver genes set above (cnGenes) that are involved in cancer genomic networks with multiple genes targeted by Olaparib. Below, we select a set of 27 genes to be over-expressed because of a genomic event, which is based directly on the gene set annotation set curated for PANOPLY.

brca1 <- colnames(reactome.adj)[reactome.adj["BRCA1",]>0]
chek2 <- colnames(reactome.adj)[reactome.adj["CHEK2",]>0]
atr <- colnames(reactome.adj)[reactome.adj["ATR",]>0]
blm <- colnames(reactome.adj)[reactome.adj["BLM",]>0]
alltbl <- table(c(brca1, chek2,atr,blm))
pickNetGenes <- names(alltbl)[alltbl>1]
kable(data.frame(Genes.Olaparib=pickNetGenes))

Genes.Olaparib
ATM
BARD1
BRCA1
BRCA2
CHEK2
FANCA
FANCC
FANCD2
FANCE
FANCF
FANCG
FANCI
FANCL
H2AFX
KAT5
MRE11A
NBN
RAD21
RAD50
RBBP8
REC8
SMC1A
SMC3
STAG1
STAG2
SYCP2
TP53

Algorithm settings

These are algorithm settings for when running the driver gene and drug tests, which are arguments used in the two main functions: panGeneSets and panDrugSets. We list them below with a brief description in the comments.

set.tailPct <- 0.2     ## pct of RNA genes in case (upper or lower tails) as "drivers" in DMT
set.tailEnd <- "upper" ## direction of tail as RNA drivers (upper, lower, or both) 
set.eventOnly=TRUE     ## in DMT, only use RNA+DNA driver events, if FALSE, uses all 429
set.nsim=100           ## n-simulations for DMT p-value. We suggest using 1000 or more, but fewer here for run-time.

Examples

RNA Drivers Only

We first ignore the DNA driver events and only analyze the RNA over-expressed genes for the simulated “case” as the driver events. We show the DNT and DMT results for a small number of case-control datasets results for Olaparib, sorted by the P.Score for each simulation. We see a moderate number of p-values significant at 0.05 for DMT, and fewer from DNT.

meanGC <- rep(0, ncol(covGC))
names(meanGC) <- colnames(covGC)
meanGCcase <- meanGC
sdDrGenes <- sqrt(diag(covGC[pickNetGenes,pickNetGenes])) * sdExpress
meanGCcase[pickNetGenes] <- 0 + sdDrGenes

set.seed(2000)

gcSim1 <- mvrnorm(1, mu=meanGCcase, Sigma=covGC)
gcSim8 <- mvrnorm(npt-1, mu=meanGC, Sigma=covGC)
gcSim <- t(rbind(gcSim1, gcSim8))
colnames(gcSim) <- paste("Null",1:npt,sep="_")

i <- colnames(gcSim)[1]
patient <- i
ptmatch <- colnames(gcSim)[!(colnames(gcSim) %in% i)]

driversRNA <- panGeneSets(caseids=patient,controlids=ptmatch,gcount=gcSim,
                     tailPct=set.tailPct, tailEnd=set.tailEnd, eventOnly=set.eventOnly)

kable(driversRNA[1:10,1:6], row.names=FALSE, caption="Differentially-Expressed Driver Gene Sets, RNA drivers")

Cancer.Gene	Network	Network.pval	Network.mean	Druggable	Total.Druggable
BARD1	10	0.0009066	3.119244	0	0
BRCA1	55	0.0123747	2.245293	1	5
KL	17	0.0136933	2.205962	0	0
ATM	37	0.0277795	1.914478	1	4
FGF10	40	0.0301957	1.877925	0	3
CHEK2	25	0.0321094	1.850658	1	4
FANCM	14	0.0348612	1.813710	0	0
BRCA2	4	0.0407196	1.742396	1	3
RAB3A	12	0.0556434	1.592436	0	0
FANCD2	6	0.0605901	1.549839	0	1

drugRNA <- panDrugSets(driversRNA,caseids=patient, controlids=ptmatch,
                 gcount=gcSim, gene.gs=reactome.gs,
                 gene.adj=reactome.adj, drug.gs=dgi.gs, drug.adj=dgi.adj,
                 minPathways=0, nsim = set.nsim, tailEnd = set.tailEnd)                    


kable(drugRNA[1:15,1:10], row.names=FALSE, digits=3, caption="Drug Test Results, RNA-only")

Drug	N.Cancer.Genes	Cancer.Genes	N.Network.Genes	Network.Genes	Network	DNT.pval	DMT.Stat	DMT.pval	PScore
ARN-509	1	AR	1	AR	1	0.000	5.136	0.21	4.884
DROMOSTANOLONE	1	AR	1	AR	1	0.000	5.136	0.21	4.884
METHYLTESTOSTERONE	1	AR	1	AR	1	0.000	5.136	0.21	4.884
BICALUTAMIDE	1	AR	1	AR	5	0.000	3.632	0.21	4.762
ANASTROZOLE	1	ESR1	2	ESR1,CYP19A1	6	0.000	2.793	0.11	4.670
KETOCONAZOLE	1	AR	2	AR,CYP19A1	22	0.000	3.010	0.21	4.123
NILUTAMIDE	1	AR	1	AR	4	0.000	3.632	0.21	4.113
BMN673	3	BRCA1,ATM,BRCA2	3	BRCA1,ATM,BRCA2	4	0.003	3.314	0.03	4.088
RUCAPARIB	3	BRCA1,ATM,BRCA2	3	BRCA1,ATM,BRCA2	4	0.003	3.314	0.03	4.088
VELIPARIB	3	BRCA1,ATM,BRCA2	3	BRCA1,ATM,BRCA2	4	0.003	3.314	0.03	4.088
AMINOGLUTETHIMIDE	0		1	CYP19A1	2	0.000	-0.055	0.82	4.069
TALAMPANEL	0		0		4	0.000	0.000	1.00	3.595
ENZALUTAMIDE	1	AR	1	AR	10	0.002	3.632	0.21	3.499
OLAPARIB	3	BRCA1,ATM,BRCA2	3	BRCA1,ATM,BRCA2	5	0.011	3.314	0.03	3.493
LETROZOLE	1	ESR1	2	ESR1,CYP19A1	5	0.004	2.793	0.11	3.317

DNA Network Drivers

We now add DNA driver events to “turn on” the neighboring genes of cancer driver genes. Since internally the functions treat variant and copy number events equally as an “event”, we simplify by simulating a copy number event. We simulate 30 random genes, plus four specific genes that have the over-expressed Olaparib-specific genes in their networks, which are ATM, CHEK2, BRCA2, and BLM.

set.seed(2000)
colnames(gcSim) <- paste("Null",1:npt,sep="_")

i <- colnames(gcSim)[1]
patient <- i
ptmatch <- colnames(gcSim)[!(colnames(gcSim) %in% i)]
cnSim <- data.frame(CHROM=1,Gene.Symbol=rownames(reactome.adj), 
                 matrix(0, nrow=nrow(reactome.adj), ncol=npt))

colnames(cnSim)[3:ncol(cnSim)] <- c(patient, ptmatch); 
cnSim[sample(nrow(reactome.adj),n.dna.events),patient] <- 1
cnSim[cnSim$Gene.Symbol %in% cnGenes,patient] <- 1

drivRNADNA <- panGeneSets(caseids=patient,controlids=ptmatch,
                        gene.adj=reactome.adj, drug.adj=dgi.adj,
                        gcount=gcSim,cna=cnSim,tailPct=set.tailPct,
                        tailEnd=set.tailEnd, eventOnly=set.eventOnly)
kable(drivRNADNA[1:10,1:6], row.names=FALSE, caption="Differentially-Expressed Driver Gene Sets from RNA and DNA drivers")

Cancer.Gene	Network	Network.pval	Network.mean	Druggable	Total.Druggable
ATR	49	0.0003531	3.387170	1	5
BARD1	10	0.0009066	3.119244	0	0
BLM	21	0.0020327	2.873043	0	3
BRCA1	55	0.0123747	2.245293	1	5
KL	17	0.0136933	2.205962	0	0
GLI1	31	0.0172074	2.115178	0	0
ATM	37	0.0277795	1.914478	1	4
FGF10	40	0.0301957	1.877925	0	3
CHEK2	25	0.0321094	1.850658	1	4
FANCM	14	0.0348612	1.813710	0	0

drugRNADNA <- panDrugSets(drivRNADNA,caseids=patient, controlids=ptmatch,
                  gcount=gcSim, gene.gs=reactome.gs, gene.adj=reactome.adj,
                  drug.gs=dgi.gs, drug.adj=dgi.adj,
                  minPathways=0, nsim=set.nsim, tailEnd=set.tailEnd)  

kable(drugRNADNA[1:15,1:10], row.names=FALSE, digits=3, caption="Drug Test Results")

Drug	N.Cancer.Genes	Cancer.Genes	N.Network.Genes	Network.Genes	Network	DNT.pval	DMT.Stat	DMT.pval	PScore
ARN-509	1	AR	1	AR	1	0.000	1.372	0.31	4.715
DROMOSTANOLONE	1	AR	1	AR	1	0.000	1.372	0.31	4.715
METHYLTESTOSTERONE	1	AR	1	AR	1	0.000	1.372	0.31	4.715
AMINOGLUTETHIMIDE	0		1	CYP19A1	2	0.000	2.560	0.20	4.682
ANASTROZOLE	1	ESR1	2	ESR1,CYP19A1	6	0.000	2.974	0.13	4.598
BICALUTAMIDE	1	AR	1	AR	5	0.000	0.970	0.31	4.593
BMN673	4	ATR,BRCA1,ATM,BRCA2	3	BRCA1,BRCA2,ATM	4	0.003	4.084	0.01	4.566
RUCAPARIB	4	ATR,BRCA1,ATM,BRCA2	3	BRCA1,BRCA2,ATM	4	0.003	4.084	0.01	4.566
VELIPARIB	4	ATR,BRCA1,ATM,BRCA2	3	BRCA1,BRCA2,ATM	4	0.003	4.084	0.01	4.566
KETOCONAZOLE	1	AR	2	AR,CYP19A1	22	0.000	1.449	0.29	3.983
OLAPARIB	4	ATR,BRCA1,ATM,BRCA2	3	BRCA1,BRCA2,ATM	5	0.011	4.084	0.01	3.970
NILUTAMIDE	1	AR	1	AR	4	0.000	0.970	0.31	3.944
ATAMESTANE	0		1	CYP19A1	1	0.001	2.560	0.20	3.848
TESTOLACTONE	0		1	CYP19A1	1	0.001	2.560	0.20	3.848
TALAMPANEL	0		0		4	0.000	0.000	1.00	3.595

Using Gene Panels

Next we provide an example for how a subset of genes, e.g., a panel set, could be used in PANOPLY. We generate a random subset of 1000 network genes, and make a new adjacency matrix from those based on the adjacenty matrix we provided. However, a user can provide their own adjacency matrix to be just like the one we created. To make sure to get sufficient mixture of driver genes, we now set eventOnly=FALSE, which means to test all remaining driver gene networks in the panGeneSets function, which are then used in panDrugSets.

set.seed(1000)
panel.genes <- sample(colnames(reactome.adj), size=1000, replace=FALSE)
panel.adj <- reactome.adj[,panel.genes]
panel.adj <- panel.adj[rowSums(panel.adj)>0,]
dim(panel.adj)

## [1]  321 1000

drivPanel <- panGeneSets(caseids=patient, controlids=ptmatch,
                   gcount=gcSim, cna=cnSim, tailPct=set.tailPct,
                   gene.adj=panel.adj, drug.adj=dgi.adj,
                   tailEnd=set.tailEnd, eventOnly=FALSE)
kable(drivPanel[1:10,1:6], digits=3, row.names=FALSE, caption="DE Driver Gene Sets from Panel Set, RNA + DNA")

Cancer.Gene	Network	Network.pval	Network.mean	Druggable	Total.Druggable
KL	2	0.003	2.799	0	0
CD274	3	0.024	1.970	0	1
SMARCB1	1	0.037	1.792	0	1
FANCL	2	0.040	1.753	0	0
SPTA1	1	0.050	1.643	0	0
CHEK1	1	0.053	1.620	1	1
FANCM	1	0.053	1.620	0	0
ATR	9	0.058	1.575	1	2
PARK2	1	0.080	1.404	0	0
BRCA1	11	0.083	1.387	1	3

drugPanel <- panDrugSets(drivPanel, caseids=patient, controlids=ptmatch,
                   gcount=gcSim, gene.gs=reactome.gs, gene.adj=panel.adj,
                   drug.gs=dgi.gs, drug.adj=dgi.adj, minPathways=0,
                   nsim=set.nsim, tailEnd=set.tailEnd)  
kable(drugPanel[1:10,1:10], digits=3, row.names=FALSE, caption="Drug Test Results from Gene Panel")

Drug	N.Cancer.Genes	Cancer.Genes	N.Network.Genes	Network.Genes	Network	DNT.pval	DMT.Stat	DMT.pval	PScore
ANASTROZOLE	1	ESR1	2	ESR1,CYP3A4	6	0.000	2.816	0.06	4.934
BICALUTAMIDE	1	AR	2	CYP3A4,AR	5	0.000	3.717	0.18	4.829
ARN-509	1	AR	1	AR	1	0.000	0.000	1.00	4.206
DROMOSTANOLONE	1	AR	1	AR	1	0.000	0.000	1.00	4.206
METHYLTESTOSTERONE	1	AR	1	AR	1	0.000	0.000	1.00	4.206
KETOCONAZOLE	1	AR	2	CYP3A4,AR	22	0.000	3.717	0.18	4.190
AMINOGLUTETHIMIDE	0		0		2	0.000	0.000	1.00	3.983
IMATINIB	8	NTRK1,CSF1R,ABL1,BCR,RET,PDGFRA,KIT,PDGFRB	2	NTRK1,CYP3A4	19	0.004	1.646	0.03	3.891
TALAMPANEL	0		0		4	0.000	0.000	1.00	3.595
LETROZOLE	1	ESR1	2	ESR1,CYP3A4	5	0.004	2.816	0.06	3.580

Other Gene Sets

Below we give a brief explanation of how we picked gene sets to be targeted by a single drug, Olaparib, in the initial manuscript for PANOPLY.

Olaparib Gene Set A

The genes in TPR simulation set A are the direct targets of the drug Olaparib, which is annotated and used within PANOPLY. The genes targeted by any drug can be found using the same two lines of code.

pickGenesA <- dgi.gs[[pickDrug]]
pickGenesA <- pickGenesA[pickGenesA %in% colnames(reactome.adj)]
kable(data.frame(Genes.SetA=pickGenesA))

Genes.SetA
ATR
ATM
BRCA1
BRCA2
PARP1

Olaparib Gene Set B

The genes in TPR simulation set B are the genes annotated as directly affected by BRCA2, which is a target of Olaparib. So the four genes being over-expressed is like BRCA2 being turned on, causing the over-expression, and that Olaparib would target that sub-network, including BRCA2 itself. The set defined directly from the gene adjacency matrix.

pickGenesB <- names(which(reactome.adj["BRCA2",]>0))
kable(data.frame(Genes.SetB=pickGenesB))

Genes.SetB
BRCA2
ESR1
RAD51
RAD51C

Olaparib Gene Set C

The genes selected for set C is not as easily defined from annotation as sets A and B above. We aimed to have a gene set of similar size as set A (5 total), but also affecting downstream genes within multiple sub-networks that are targeted by Olaparib. We show below the genes in set C (7 total), followed by the cancer driver genes that are upstream of three or more of the set C genes.

pickGenesC <- c("BRCA1","RAD21","FANCG","FANCI","SUMO2","H2AFX","ATM")
kable(data.frame(Genes.SetC=pickGenesC))

Genes.SetC
BRCA1
RAD21
FANCG
FANCI
SUMO2
H2AFX
ATM

DriversC=names(which(rowSums(reactome.adj[,pickGenesC])>2))
kable(data.frame(Drivers.SetC=DriversC), caption="Cancer Driver Genes that target 3 or more genes from set C")

Drivers.SetC
ATM
ATR
BRCA1
CHEK2
RAD51C

Session Information

For reproducibility, we provide session information.

sessionInfo()

## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.9 (Final)
## 
## Matrix products: default
## BLAS: /usr/lib64/libblas.so.3.2.1
## LAPACK: /usr/lib64/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=C               LC_MONETARY=en_US.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.20          panoply_0.981       RColorBrewer_1.1-2  randomForest_4.6-12 Rgraphviz_2.22.0    graph_1.56.0       
##  [7] BiocGenerics_0.24.0 circlize_0.4.2      gage_2.28.0         MASS_7.3-47        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15         highr_0.6            formatR_1.5          pillar_1.1.0         compiler_3.4.2       XVector_0.18.0      
##  [7] tools_3.4.2          zlibbioc_1.24.0      digest_0.6.12        bit_1.1-14           evaluate_0.10.1      RSQLite_2.0         
## [13] memoise_1.1.0        tibble_1.4.2         png_0.1-7            rlang_0.1.6          DBI_0.8              httr_1.3.1          
## [19] stringr_1.3.0        Biostrings_2.46.0    S4Vectors_0.16.0     GlobalOptions_0.0.12 IRanges_2.12.0       stats4_3.4.2        
## [25] bit64_0.9-7          Biobase_2.38.0       R6_2.2.2             AnnotationDbi_1.40.0 blob_1.1.0           magrittr_1.5        
## [31] KEGGREST_1.18.0      mime_0.5             shape_1.4.3          colorspace_1.3-2     stringi_1.1.7        markdown_0.8