Introduction to CoxPH-SuSiE

Last updated: 2024-12-18

Checks: 7 0

Knit directory: survival-susie/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20230201)

The command set.seed(20230201) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: e6ba60e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e6ba60e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store

Untracked files:
    Untracked:  survival_demo.RData

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/coxph_susie_intro.Rmd) and HTML (docs/coxph_susie_intro.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	e6ba60e	Peter Carbonetto	2024-12-18	workflowr::wflow_publish("analysis/coxph_susie_intro.Rmd", verbose = TRUE,
html	ede6168	Peter Carbonetto	2024-12-18	Added prior variance estimation and the SER BFs to the
Rmd	0789339	Peter Carbonetto	2024-12-18	workflowr::wflow_publish("analysis/coxph_susie_intro.Rmd", verbose = TRUE,
html	9189825	Peter Carbonetto	2024-12-13	In coxph_susie_intro demo, added call to ibss_from_ser() with
Rmd	e483e17	Peter Carbonetto	2024-12-13	workflowr::wflow_publish("analysis/coxph_susie_intro.Rmd", verbose = TRUE)
html	dd1fd56	Peter Carbonetto	2024-12-10	Updated the coxph_susie_intro page.
Rmd	0dec649	Peter Carbonetto	2024-12-10	Updated authors and date for coxph_susie_intro vignette.
html	183f1fb	Peter Carbonetto	2024-12-10	Added comparison of coefs to coxph_susie_intro demo.
Rmd	991bd75	Peter Carbonetto	2024-12-10	workflowr::wflow_publish("coxph_susie_intro.Rmd", verbose = TRUE,
html	de9be10	Peter Carbonetto	2024-12-09	Small fix to one of the plots in the introductory vignette.
Rmd	35b0289	Peter Carbonetto	2024-12-09	workflowr::wflow_publish("coxph_susie_intro.Rmd", verbose = TRUE)
html	7442aeb	Peter Carbonetto	2024-12-09	Added more explanatory text to the introductory vignette.
Rmd	10c35d0	Peter Carbonetto	2024-12-09	workflowr::wflow_publish("coxph_susie_intro.Rmd", verbose = TRUE)
Rmd	caed3b0	Peter Carbonetto	2024-12-09	Made a few improvements to the text of the intro vignette.
Rmd	b323562	Peter Carbonetto	2024-12-09	Added fine-mapping step and fine-mapping plots to coxph_susie_intro Rmd.
Rmd	0328388	Peter Carbonetto	2024-12-09	Added a bit of text to the coxph_susie_intro vignette.
html	b849946	Peter Carbonetto	2024-12-09	Added some material to the coxph-susie demo.
html	92e5671	Peter Carbonetto	2024-12-09	Added link to the overview page.
Rmd	32e28c5	Peter Carbonetto	2024-12-09	workflowr::wflow_publish("analysis/index.Rmd")
html	d421a83	Peter Carbonetto	2024-12-06	Build site.
Rmd	2a72a02	Peter Carbonetto	2024-12-06	workflowr::wflow_publish("analysis/coxph_susie_intro.Rmd")
html	be527ce	Peter Carbonetto	2024-12-06	First build of the coxph_susie_intro vignette.
Rmd	f62c839	Peter Carbonetto	2024-12-06	workflowr::wflow_publish("analysis/coxph_susie_intro.Rmd")

This short vignette is intended to (i) introduce CoxPH-SuSiE for fine-mapping time-to-event (TTE) phenotypes, and (ii) illustrate how to use CoxPH-SuSiE on an example genetic data set.

Initial setup

To run the code below, you will need the survival, susieR and logisticsusie R packages, which can be installed with these commands in R:

install.packages("remotes")
install.packages("survival")
remotes::install_github("stephenslab/susieR")
remotes::install_github("karltayeb/logisticsusie")

Also, we will use the ggplot2 and cowplot packages to create some helpful visualizations of the data and results:

install.packages("ggplot2")
install.packages("cowplot")

Once the packages are installed, load them.

library(survival)
library(susieR)
library(logisticsusie)
library(ggplot2)
library(cowplot)

Set the seed for reproducibility.

set.seed(1)

The data set

The example data set is found in the “data” directory of this GitHub repository, or can be downloaded by running these commands in R:

data_url <- paste0("https://raw.githubusercontent.com/",
                   "yunqiyang0215/survival-susie/",
                   "refs/heads/master/data/survival_demo.RData")
download.file(data_url,"survival_demo.RData")
load("survival_demo.RData")

In this example, the TTE data are “censored” disease diagnosis times, meaning that the observations are either: the time of diagnosis (for uncensored samples); or a lower limit on the time of diagnosis for censored samples. The censored samples are indicated by the “+” (this is called “right censoring”):

pheno <- Surv(time,status)
head(pheno,n = 10)
#  [1] 0.3085771  0.4654124  2.7448602+ 1.3071323+ 0.2252482  3.3300303+
#  [7] 1.0663338+ 7.4103627+ 0.9918607  0.0637775

Among the 574 diagnosis times in this data set, 330 are observed, and the remaining 244 are right-censored:

table(status)
# status
#   0   1 
# 244 330

The “geno” matrix contains the (imputed) genotype data for 1,000 candidate genetic variants (SNPs) located near each other on the chromosome.

nrow(geno)
ncol(geno)
# [1] 574
# [1] 1000

Association analysis using the CoxPH model

Association tests performed using the Cox proportional hazards (CoxPH) model show that many of thee genetic variants are associated with diagnosis time:

pvals <-
  apply(geno,2,
        function (x) summary(coxph(pheno ~ x))$coefficients[1,"Pr(>|z|)"])
pdat <- data.frame(pos = 1:1000,pval = pvals,causal = (b != 0))
ggplot(pdat,aes(x = pos,y = -log10(pval),color = causal,shape = causal)) +
  geom_point(size = 1.25) +
  scale_color_manual(values = c("black","darkorange")) +
  scale_shape_manual(values = c(1,17)) +
  labs(x = "chromosomal position",y = "-log10 p-value") +
  theme_cowplot(font_size = 10)

Version	Author	Date
de9be10	Peter Carbonetto	2024-12-09
7442aeb	Peter Carbonetto	2024-12-09
92e5671	Peter Carbonetto	2024-12-09

Since we simulated these data, we know that 3 of the 1,000 SNPs truly affect the TTE phenotype (drawn in orange in the plot above). All three of these “causal” SNPs show evidence for association from the CoxPH-based association tests, but the strength of support for association varies quite a bit among the three SNPs. Further, many more SNPs that do not affect the phenotype are also strongly associated with the phenotype. This happens because many of the SNPs are strongly correlated with each other (in strong linkage disequilibrium, or “LD”).

Next we will run the CoxPH-SuSiE fine-mapping method to try and narrow the association signal down to a smaller set of causal candidates.

Fine-mapping analysis using the CoxPH model

To run the fine-mapping, we first need to define a function that performs a (Bayesian) univariate association analysis (here, using the CoxPH model). More precisely, this function takes as input the regression data (X, Y, as well as an additional “offset” term), and the prior variance of the regression coefficient (v0), and outputs four quantities: (1) the posterior mean of the regression coefficient; (2) its posterior variance; (3) the log-Bayes factor comparing the model against the “null” model (when the coefficient is zero); and (4) an estimate of the prior variance parameter (here we use the maximum-likelihood estimate, which is simply the posterior second moment of the regression coefficient. (This last quantity can be a constant if you want to keep the prior variance at a fixed value.)

surv_uni_fun <- function (x, y, e, prior_variance,
                          estimate_intercept = 0, ...) {
  v0   <- prior_variance                         
  fit  <- coxph(y ~ x + offset(e))
  out  <- summary(fit)$coefficients
  bhat <- out[1,"coef"]
  s    <- out[1,"se(coef)"]
  z    <- bhat/s
  lbf  <- log(s^2/(v0 + s^2))/2 + z^2/2*v0/(v0 + s^2)
  lbf  <- lbf - z^2/2 + summary(fit)$logtest["test"]/2
  v1   <- 1/(1/v0 + 1/s^2)
  mu1  <- v1*bhat/s^2
  return(list(mu = mu1,var = v1,lbf = lbf,
              prior_variance = mu1^2 + v1,
              intercept = 0))
}

Now we can fit the CoxPH-based SuSiE model using the ibss_from_ser() function from the logisticsusie package. Note that this step may take a minute or two to run depending on the computing platform:

t0 <- proc.time()
fit <- ibss_from_ser(geno,pheno,L = 3,tol = 0.0001,maxit = 100,
                     ser_function = ser_from_univariate(surv_uni_fun))
t1 <- proc.time()
print(t1 - t0)
# 7.216 sec elapsed
#    user  system elapsed 
#   7.052   0.169   7.222

(As an aside, note that the univariate regression computations are trivially parallelizable, so you have a multicore computing platform, you can make the method run faster, e.g.,

t0 <- proc.time()
fit <- ibss_from_ser(geno,pheno,L = 3,tol = 0.0001,maxit = 100,
                     ser_function = ser_from_univariate(surv_uni_fun),
                     num_cores = 4)
t1 <- proc.time()
print(t1 - t0)
# 2.586 sec elapsed
#    user  system elapsed 
#   8.151   1.504   2.590

In this run we were able to achieve about a 3x speedup with 4 threads.)

Now we use the susie_get_cs() function from the susieR package to compute credible sets (CSs):

class(fit) <- c("susie","list")
out <- susie_get_cs(fit,geno,coverage = 0.95,min_abs_cor = 0.1)
out$cs
# $L2
# [1] 994
# 
# $L1
# [1] 550 555 561 566 582 598

Interestingly, in one credible set, CoxPH-SuSiE is able to identify a single causal SNP with very high confidence (shown in blue in the plot). Additionally, a second set of causal candidates is identified with 6 plausible candidates (shown in orange). Both of these CSs contain the true causal SNPs.

pdat <- cbind(pdat,CS = 0)
pdat[out$cs$L1,"CS"] <- 1
pdat[out$cs$L2,"CS"] <- 2
pdat <- transform(pdat,CS = factor(CS))
cs_colors <- c("black","darkorange","dodgerblue")
ggplot(pdat,aes(x = pos,y = -log10(pval),color = CS,shape = causal)) +
  geom_point(size = 1.25) +
  scale_color_manual(values = cs_colors) +
  scale_shape_manual(values = c(1,17)) +
  labs(x = "chromosomal position",y = "-log10 p-value") +
  theme_cowplot(font_size = 10)

Version	Author	Date
7442aeb	Peter Carbonetto	2024-12-09

The PIPs (“posterior inclusion probabilities”) measure the relative support for the SNPs within each CS. Plotting the PIPs for all SNPs is helpful for understanding precisely which SNPs are the top candidates:

pdat <- cbind(pdat,pip = fit$pip)
ggplot(pdat,aes(x = pos,y = pip,color = CS,shape = causal)) +
  geom_point(size = 1.25) +
  scale_color_manual(values = cs_colors) +
  scale_shape_manual(values = c(1,17)) +
  labs(x = "chromosomal position",y = "PIP") +
  theme_cowplot(font_size = 10)

Version	Author	Date
9189825	Peter Carbonetto	2024-12-13
7442aeb	Peter Carbonetto	2024-12-09

For example, in CS 1 (orange), 3 out of the 6 SNPs are much stronger candidates than the others.

It is also interesting that the third causal SNPs that was not identified does show some modest evidence for being a causal SNP in the PIP plot.

Let’s compare the coefficient estimates against the ground-truth estimates. These are contained in the “mu” output. Here we see that the first CS (L1) accurately estimates the coefficient for the second causal SNP (the true value is -2), and the second CS (L2) accurately estimates the coefficient for the third causal SNP (the true value is 2).

rownames(fit$mu) <- c("L1","L2","L3")
causal_snps <- which(b != 0)
round(rbind(b,fit$mu)[,causal_snps],digits = 3)
#     [,1]   [,2]  [,3]
# b  1.000 -2.000 2.000
# L1 1.215 -2.088 0.252
# L2 0.834 -0.076 1.927
# L3 0.988 -0.063 0.127

Finally, note that CoxPH-SuSiE is robust to overestimating the number of causal SNPs. For example, suppose we set the number of causal SNPs to 10:

fit <- ibss_from_ser(geno,pheno,L = 10,tol = 0.0001,maxit = 100,
                     ser_function = ser_from_univariate(surv_uni_fun))

The log-Bayes factors quantifying support for each CS (i.e., each causal SNP) are in the “lbf_ser” output:

fit$lbf_ser
#         L1         L2         L3         L4         L5         L6         L7 
# 113.309194   8.742860  -1.467698  -1.465683  -1.463014  -1.460718  -1.459312 
#         L8         L9        L10 
#  -1.458954  -1.459594  -1.461082

After the first two CSs, the remaining 8 CSs provide very little improvement in fit. This result tells us that we can focus on the results for the first two CSs and ignore the rest. (Note these log-Bayes factors are computed using the natural logarithm.)

sessionInfo()
# R version 4.3.3 (2024-02-29)
# Platform: aarch64-apple-darwin20 (64-bit)
# Running under: macOS Sonoma 14.7.1
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: America/Chicago
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] cowplot_1.1.3            ggplot2_3.5.0            logisticsusie_0.0.0.9007
# [4] susieR_0.12.42           survival_3.5-8           workflowr_1.7.1         
# 
# loaded via a namespace (and not attached):
#  [1] gtable_0.3.4       xfun_0.42          bslib_0.6.1        processx_3.8.3    
#  [5] lattice_0.22-5     callr_3.7.5        vctrs_0.6.5        tools_4.3.3       
#  [9] ps_1.7.6           generics_0.1.3     parallel_4.3.3     tibble_3.2.1      
# [13] fansi_1.0.6        highr_0.10         pkgconfig_2.0.3    Matrix_1.6-5      
# [17] RcppParallel_5.1.7 lifecycle_1.0.4    compiler_4.3.3     farver_2.1.1      
# [21] stringr_1.5.1      git2r_0.33.0       tictoc_1.2.1       munsell_0.5.0     
# [25] getPass_0.2-4      httpuv_1.6.14      htmltools_0.5.7    sass_0.4.8        
# [29] yaml_2.3.8         later_1.3.2        pillar_1.9.0       crayon_1.5.2      
# [33] jquerylib_0.1.4    whisker_0.4.1      cachem_1.0.8       tidyselect_1.2.1  
# [37] digest_0.6.34      stringi_1.8.3      purrr_1.0.2        dplyr_1.1.4       
# [41] labeling_0.4.3     splines_4.3.3      rprojroot_2.0.4    fastmap_1.1.1     
# [45] grid_4.3.3         colorspace_2.1-0   cli_3.6.2          magrittr_2.0.3    
# [49] Rfast_2.1.0        utf8_1.2.4         withr_3.0.0        scales_1.3.0      
# [53] promises_1.2.1     RcppZiggurat_0.1.6 rmarkdown_2.26     httr_1.4.7        
# [57] matrixStats_1.2.0  evaluate_0.23      knitr_1.45         irlba_2.3.5.1     
# [61] rlang_1.1.3        Rcpp_1.0.12        mixsqp_0.3-54      glue_1.7.0        
# [65] rstudioapi_0.15.0  reshape_0.8.9      jsonlite_1.8.8     R6_2.5.1          
# [69] plyr_1.8.9         fs_1.6.3

Introduction to CoxPH-SuSiE

Yunqi Yang, Karl Tayeb, Peter Carbonetto and Matthew Stephens

December 10, 2024

Initial setup

The data set

Association analysis using the CoxPH model

Fine-mapping analysis using the CoxPH model