1 Objective

In this practical, we will learn how to perform polygenic prediction using Bayesian methods. Part 1 introduces a Bayesian method, BayesR, using individual genotypes and phenotypes. Part 2 extends the method to SBayesR (summary-data-based BayesR) which requires GWAS summary statistics only. Part 3 further enhances the method by incorporating functional annotations (SBayesRC). The goal of this practical exercise is to understand the core algorithm of these Bayesian methods and the principle of Markov chain Monte Carlo (MCMC). Therefore, we will use a small data set and R scripts. In practice, I recommend use GCTB (https://cnsgenomics.com/software/gctb/#Overview) to perform real data analysis which can efficiently analyze large-scale genomic and trait data.

2 Library requirement

To run the provided R scripts, you need to install the following packages in your R or Rstudio:

install.packages("MCMCpack")
install.packages("truncnorm")

3 Data

The data set for this practical exercise includes:

  • A training population (discovery GWAS) of 325 individuals, genotyped for 10 SNPs.
  • A validation population of 31 individuals (hold-out sample).

The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.

Load the data in R.

# data for training population
data_path="./"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))

# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))

4 Part I - BayesR

BayesR method for PGS prediction using individual-level genotype and phenotype data.

4.1 Understanding the algorithm

This section helps you understand the R code used in the method.

Different Bayesian methods vary only in the prior specification for SNP effects. Here, we focus on a Bayesian method called BayesR. The algorithm for BayesR with MCMC sampling scheme is implemented in the R script bayesr.R.

Do not run the R code below in this section.

BayesR assumes each SNP effect follows a mixture of normal distributions. The input parameters include

# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2){
  ...
}
  • niter is the number of iterations for MCMC sampling
  • X is the genotype matrix
  • y is the vector of phenotypes
  • gamma is a vector of scaling factors for the variance of the mixture components
  • startPi is a vector of starting values for the mixture proportions \(\pi\)
  • startH2 is the starting value of SNP-based heritability.

The number of elements in gamma and startPi determines the number of mixture components, and they must match.

Let’s dive into more details to understand what the code does.

The first step is to declare and initialise variables. This step also adjusts y for the population mean and SNP effects (which are initially set to zero), so that ycorr represents the residuals.

  # Step 1. Declaration and initialisation of variables
  n = nrow(X)          # number of observations
  m = ncol(X)          # number of SNPs
  ndist = length(startPi)  # number of mixture distributions
  pi = startPi         # starting value for pi
  h2 = startH2         # starting value for heritability
  vary = var(y)        # phenotypic variance
  varg = vary*h2       # starting value for genetic variance
  vare = vary*(1-h2)   # starting value for residual variance
  sigmaSq = varg/(m*sum(gamma*pi))    # common factor of SNP effect variance
  nub = 4              # prior degrees of freedom for SNP effect variance
  nue = 4              # prior degrees of freedom for residual variance
  scaleb = (nub-2)/nub*sigmaSq  # prior scale parameter for SNP effect variance
  scalee = (nue-2)/nue*vare     # prior scale parameter for residual variance
  beta = array(0,m)    # vector of SNP effects
  beta_mcmc = matrix(0,niter,m) # MCMC samples of SNP effects
  mu = mean(y)         # overall mean
  xpx = apply(X, 2, crossprod)  ## diagonal elements of X'X
  probDelta = vector("numeric", ndist)
  logDelta = array(0,2)
  keptIter = NULL      # keep MCMC samples
  
  # adjust/correct y for population mean and all SNP effects (which all have zero as initial value)
  ycorr = y - mu 

Then the Gibbs sampling begins. Gibbs sampling iteratively draws samples from the full conditional distribution of each parameter (i.e., posterior distribution conditional on the values of other parameters). Over time, this converges to the joint posterior distribution.

The first step within each iteration is to sample the population mean (fixed effect) from its full conditional distribution, assuming a flat prior. This distribution is centered at the BLUE (best linear unbiased estimator) solution, with variance equal to the inverse of the coefficient matrix in the “mini” mixed model equation for the mean.

  # MCMC begins
  for (iter in 1:niter){
    # Step 2. Sample the mean from a normal posterior
    ycorr = ycorr + mu   # unadjust y with the old sample
    rhs = sum(ycorr)     # right hand side of the mixed model equation
    invLhs = 1/n         # inverse of left hand side of the mixed model equation
    muHat = invLhs*rhs   # BLUE solution
    mu = rnorm(1, muHat, sqrt(invLhs*vare))   # posterior is a normal distribution
    ycorr = ycorr - mu   # adjust y with the new sample

The next step is to sample the mixture component indicator variable and effect size for each SNP, conditional on the effects of all the other SNPs. We first sample the indicator variable delta for the mixture distribution membership (multinomial posterior), and then sample the SNP effect beta from the corresponding normal distribution or a point mass at zero.

    # Step 3. Sample each SNP effect from a multi-normal mixture distribution
    logPi = log(pi)
    logPiComp = log(1-pi)
    invSigmaSq = 1/(gamma*c(sigmaSq))
    logSigmaSq = log(gamma*c(sigmaSq))
    nsnpDist = rep(0, ndist)  # number of SNPs asigned to each mixture component
    ssq = 0  # weighted sum of squares of SNP effects 
    nnz = 0  # number of nonzero effects
    ghat = array(0,n)  # individual genetic value
    # loop over SNPs
    for (j in 1:m){
      oldSample = beta[j]
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      
      # sample the mixture distribution membership
      logDelta = 0.5*(log(invLhs) - logSigmaSq + betaHat*c(rhs)) + logPi  # log likelihood + prior for nonzero effects
      logDelta[1] = logPi[1]                                              # log likelihood + prior for zero effect
      for (k in 1:ndist) {
        probDelta[k] = 1.0/sum(exp(logDelta - logDelta[k]))   # posterior probability for each distribution membership
      }
      
      delta = sample(1:ndist, 1, prob = probDelta)   # indicator variable for the distribution membership
      nsnpDist[delta] = nsnpDist[delta] + 1
      
      if (delta > 1) {
        beta[j] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))  # given the distribution membership, the posterior is a normal distribution
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        ghat = ghat + X[,j]*beta[j]                              
        ssq = ssq + beta[j]^2 / gamma[delta]
        nnz = nnz + 1
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        beta[j] = 0
      }
    }   
    beta_mcmc[iter,] = beta

Next, we sample the other parameters in the model, including mixing probabilities (\(\pi\)), SNP effect variance (\(\sigma^2_{\beta}\)), and residual variance (\(\sigma^2_e\)). The full conditional distribution for \(\pi\) is a beta distribution. For \(\sigma^2_{\beta}\) and \(\sigma^2_e\), they are scaled inverse chi-square distributions.

    # Step 4. Sample the distribution membership probabilities from a Dirichlet distribution
    pi = rdirichlet(1, nsnpDist + 1)
    
    # Step 5. Sample the SNP effect variance from a scaled inverse chi-square distribution
    sigmaSq = (ssq + nub*scaleb)/rchisq(1, nnz+nub)
    
    # Step 6. Sample the residual variance from a scaled inverse chi-square distribution
    vare = (crossprod(ycorr) + nue*scalee)/rchisq(1, n+nue)

Given the genetic values and residual variance, we can compute the SNP-based heritability. This is an appealing property of the MCMC approach – given the sampled values of SNP effects, you can easily compute any statistics of interest.

    # Step 7. Compute the SNP-based heritability
    varg = var(ghat)
    h2  = varg/(varg + vare)

The final step in each iteration is to store the MCMC samples for the parameter values

    keptIter <- rbind(keptIter,c(pi, nnz, sigmaSq, h2, vare, varg))

After completing all MCMC iterations, we compute posterior means as point estimates.

  colnames(keptIter) <- c(paste0("Pi", 1:length(pi)),"Nnz","SigmaSq","h2","Vare","Varg")
  postMean = apply(keptIter, 2, mean)  # posterior mean of MCMC samples is used as the point estimate for each parameter
  cat("\nPosterior mean:\n")
  print(postMean)
  return(list(par=keptIter, beta=beta_mcmc))

4.2 Data analysis

Let’s load the script in R.

source("bayesr.R")

BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.

\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]

For example, we can set a 4-component mixture model:

gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
res_bayesr = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
## 
##  iter  100, nnz =    6, sigmaSq =  2.934, h2 =  0.519, vare =  4.043, varg =  4.361 
## 
##  iter  200, nnz =   10, sigmaSq =  2.360, h2 =  0.571, vare =  3.154, varg =  4.193 
## 
##  iter  300, nnz =    5, sigmaSq =  3.638, h2 =  0.446, vare =  4.391, varg =  3.529 
## 
##  iter  400, nnz =    3, sigmaSq =  8.631, h2 =  0.464, vare =  4.076, varg =  3.533 
## 
##  iter  500, nnz =    2, sigmaSq =  6.698, h2 =  0.491, vare =  4.126, varg =  3.986 
## 
##  iter  600, nnz =    8, sigmaSq =  5.785, h2 =  0.505, vare =  3.680, varg =  3.757 
## 
##  iter  700, nnz =    5, sigmaSq =  4.445, h2 =  0.548, vare =  3.550, varg =  4.310 
## 
##  iter  800, nnz =    7, sigmaSq =  5.414, h2 =  0.532, vare =  3.607, varg =  4.092 
## 
##  iter  900, nnz =    8, sigmaSq =  1.477, h2 =  0.480, vare =  4.140, varg =  3.821 
## 
##  iter 1000, nnz =    3, sigmaSq =  4.810, h2 =  0.487, vare =  3.959, varg =  3.763 
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.3028915 0.3123460 0.1926270 0.1921354 6.8590000 4.1479521 0.5002787 3.8989247 
##      Varg 
## 3.9123225

The output includes sampled values of key parameters every 100 iterations.

After MCMC, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can summarise the posterior mean and standard deviation for each parameter by

colMeans(res_bayesr$par)
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.3028915 0.3123460 0.1926270 0.1921354 6.8590000 4.1479521 0.5002787 3.8989247 
##      Varg 
## 3.9123225
apply(res_bayesr$par, 2, sd)
##        Pi1        Pi2        Pi3        Pi4        Nnz    SigmaSq         h2 
## 0.19183050 0.21272367 0.16415316 0.12186105 2.21946954 4.37814400 0.03403552 
##       Vare       Varg 
## 0.31663773 0.41436110

The posterior mean gives a point estimate, and the standard eviation gives an estimate of uncertainty (posterior standard error).

You can plot parameter traces over iterations to check convergence. For example:

# Trace plot for SNP 1, which is the causal variant with effect size of 2
plot(1:nrow(res_bayesr$beta), res_bayesr$beta[,1], xlab="Iteration", ylab="beta[,1]") 
abline(h = 2, col="red")

# Trace plot for SNP 2, which is a null SNP
plot(1:nrow(res_bayesr$beta), res_bayesr$beta[,2], xlab="Iteration", ylab="beta[,2]") 
abline(h = 0, col="red")

# Trace plot for SNP 5, which is a causal variant with a smaller effect of 1 and in LD with other SNPs
plot(1:nrow(res_bayesr$beta), res_bayesr$beta[,5], xlab="Iteration", ylab="beta[,5]") 
abline(h = 1, col="red")

Remarks:

  • For SNP 1, the trace plot fluctuates around 2, consistent with its true effect.
  • For SNP 5, the plot fluctuates around 1, with more variability due to LD with nearby SNPs.
  • For SNP 2 (null SNP), the plot fluctuates around 0.

This indicates good mixing and convergence.


We can also plot the posterior distribution for each SNP effect. We discard the first 100 iterations of the program as “burn in”:

# Posterior distribution of SNP 1 effect
plot(density(res_bayesr$beta[100:1000,1]))

# Posterior distribution of SNP 2 effect
plot(density(res_bayesr$beta[100:1000,2]))

Remarks:

  • For causal SNPs, the posterior distribution of effects is approximately normal but may be slightly skewed or truncated due to the mixture prior.
  • For null SNPs, the posterior is often a spike at zero, reflecting the point-mass component of the prior.
  • For other parameters like heritability and variance components, the posteriors are not necessarily symmetric, especially with small sample sizes.


We can measure SNP-trait association by computing the posterior inclusion probability (PIP), the probability of the SNP being fitted as a non-zero effect in the model across MCMC samples. It has been widely used in the fine-mapping analysis.

For example, PIP for SNP 1 can be calculated as

mean(res_bayesr$beta[100:1000,1] != 0)
## [1] 1

This could be generalised for all SNPs:

pip = colMeans(res_bayesr$beta[100:1000, ] != 0)
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")

Question: Which SNPs are likely to be the causal variants based on PIP?




Answer: SNPs with high PIP (close to 1) are likely to be causal (note that SNP 1 and SNP 5 are causal variants in this simulation). A plot of PIP across SNPs will highlight those with strong support. SNPs with PIP near 0 are unlikely to be associated.


To get the SNP effect estimates, we calculate the posterior means of SNP effects:

betaMean = colMeans(res_bayesr$beta)


Question: How to predict PGS using BayesC effect estimates? Does the prediction accuracy make sense?




Answer:

ghat_bayesr = Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.5267547

Unlike the conventional method C+PT, BayesR provides SNP joint effect estimates which can be directly used to construct PGS without LD pruning or SNP selection. Due to the use of a mixture prior, BayesR allows effect sizes to be exactly zero, which helps avoid overfitting noise. The theoretical upper bound for prediction accuracy is the simulated heritability 0.5. The observed prediction accuracy is slighly over 0.5 because of large sampling variance due to the small validation sample size.

5 Part II - SBayesR

5.1 Understanding the algorithm

This section helps you understand the R code used in the method.

Do not run the R code below in this section.

The algorithm for SBayesR (Summary-data-based BayesR) is implemented in sbayesr.R.

Here we focus on the code that is different from individual-data-based BayesR implemented in bayesr.R.

Compared to BayesR, SBayesR includes an additional step at the beginning to scale all GWAS marginal effects so that they are in the per-genotype-SD unit with phenotypic variance equal to 1. GWAS is usually performed using the 0/1/2 genotypes, whereas the algorithm assumes standardised genotypes, as shown in the lecture.

  # Step 1. scale the GWAS SNP effects
  scale = sqrt(1/(n*se^2 + b^2))  # scale factors for marginal SNP effects
  b_scaled  = b*scale             # scaled marginal SNP effects (in units of genotype sd / phenotype sd)
  vary  = 1                       # phenotypic variance = 1 after the scaling

Once MCMC begins, the next step is to sample SNP effects. Unlike BayesR, SBayesR does not include fixed effects, because they have already been adjusted in the GWAS. For each SNP, if its effect is not zero, we sample it from a normal distribution where the mean is the BLUP solution to the “per-SNP mixed model equation”. The only difference in this part between SBayesR and BayesR is the use of summary data equivalents:

  • \(X_j'y_{corr}\) is replaced by \(n b_{corr}\)
  • \(X_j'X_j\) is replaced by \(n\)
      rhs = n*bcorr[i] + n*oldSample          # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(n/c(vare) + invSigmaSq)   # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                 # BLUP solution
      
      # if the effect is not zero, we sample it from a normal distribution 
      beta[i] = rnorm(1, betaHat[delta], sqrt(invLhs[delta]))   # given the distribution membership, the posterior is a normal distribution
      
      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample  # right hand side of the mixed model equation
      rhs = rhs/vare
      invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq)        # inverse of left hand side of the mixed model equation
      betaHat = invLhs*c(rhs)                           # BLUP solution
      ###################################################################

After sampling the SNP effect \(\beta_j\), instead of adjusting \(y\) (as in BayesR), we update \(b_{corr}\) in SBayesR. This strategy is known as “right-hand side updating”, because it updates the vector of \(X'y\) (\(=nb\)), the right-hand side of the mixed model equations.

      if (delta > 1) {
        # ...
        bcorr = bcorr + R[,i]*(oldSample - beta[i])         # update bhatcorr with the new sample
        # ...
      } else {
        if (oldSample) bcorr = bcorr + R[,i]*oldSample      # update bhatcorr with the new sample which is zero
        # ...
      }

      ###################################################################
      # In contrast, this is what we have in individual-data-based BayesR
      if (delta > 1) {
        # ...
        ycorr = ycorr + X[,j]*(oldSample - beta[j])              # update ycorr with the new sample
        # ...
      } else {
        if (oldSample) ycorr = ycorr + X[,j]*oldSample           # update ycorr with the new sample which is zero
        # ...
      }
      ###################################################################

This approach is efficient because \(b_{corr}\) is only updated for SNPs in LD with SNP \(i\), which is typically a smaller subset within an LD block.

The Gibbs samplers for \(\pi\) and \(\sigma^2_{\beta}\) are the same as in BayesR. To compute genetic variance, we use

\[\sigma^2_g = \beta'R\beta = \beta'(b - b_{corr})\] because \(b_{corr} = b - R\beta\). Since phenotypic variance is set to 1, SNP-based heritability equals the genetic variance.

    # Step 5. Compute the SNP-based heritability
    bRb = crossprod(beta, (b-bcorr))  # compute genetic variance = beta' R beta
    varg = bRb
    h2  = varg

We can also estimate the residual variance. However, unlike BayesR, the residual variance is not guaranteed to be positive. Issues like LD mismatches between GWAS and reference samples, variation in per-SNP sample size, or errors in the GWAS summary statistics can result in a negative residual variance estimate. When this happens, the algorithm cannot proceed, so we force the residual variance to equal 1 (phenotypic variance).

    sse = (vary - varg)*n
    vare = (sse + nue*scalee)/rchisq(1, n+nue)
    if (vare <= 0) vare = vary  # sometimes sse can be negative and would cause a problem

5.2 Data analysis

SBayesR method is implemented in sbayesr.R.

source("sbayesr.R")

First, we need to obtain GWAS effect estimates (using 0/1/2 genotypes):

# run GWAS on the 0/1/2 genotypes
fit = apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,] 

Compute LD correlation matrix:

R = cor(X)

Then we scale the GWAS effects to be in per-genotype-standard-deviation unit:

nind = nrow(X)  # GWAS sample size
scale = sqrt(1/(nind*se^2 + b^2))  # calculate the scale factor for each SNP
b_scaled = b*scale  # scale the marginal effects

Now we are ready to run SBayesR:

res_sbayesr = sbayesr(b, se, nind, R)
## 
##  iter  100, nnz =    2, sigmaSq =  1.151, h2 =  0.559, vare =  0.422
## 
##  iter  200, nnz =    2, sigmaSq =  1.569, h2 =  0.390, vare =  0.588
## 
##  iter  300, nnz =    2, sigmaSq =  0.998, h2 =  0.476, vare =  0.573
## 
##  iter  400, nnz =    3, sigmaSq =  1.061, h2 =  0.543, vare =  0.467
## 
##  iter  500, nnz =    7, sigmaSq =  1.458, h2 =  0.454, vare =  0.517
## 
##  iter  600, nnz =    2, sigmaSq = 10.218, h2 =  0.455, vare =  0.538
## 
##  iter  700, nnz =    6, sigmaSq =  0.757, h2 =  0.493, vare =  0.503
## 
##  iter  800, nnz =    3, sigmaSq =  3.030, h2 =  0.442, vare =  0.521
## 
##  iter  900, nnz =    4, sigmaSq =  5.699, h2 =  0.579, vare =  0.448
## 
##  iter 1000, nnz =    3, sigmaSq =  5.274, h2 =  0.521, vare =  0.463
## 
## Posterior mean:
##       Pi1       Pi2       Pi3       Pi4       Nnz   SigmaSq        h2      Vare 
## 0.4582901 0.2420419 0.1867185 0.1129495 4.5690000 2.2151993 0.5039532 0.4953590
beta_sbayesr = colMeans(res_sbayesr$beta)

Run BayesR as benchmark:

beta_bayesr = colMeans(res_bayesr$beta)

Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference?

cor(beta_bayesr, beta_sbayesr)
## [1] 0.9933062
plot(beta_bayesr, beta_sbayesr)
abline(a=0, b=1)




Answer: The small differences are mostly due to variation in MCMC sampling, known as Monte Carlo variance.


The posterior inclusion probability (PIP) are also expected to be similar between SBayesR and BayesR:

delta_bayesr = (res_bayesr$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_bayesr = colMeans(delta_bayesr)    # frequency of the indicator variable being one across MCMC cycles

delta_sbayesr = (res_sbayesr$beta != 0)
pip_sbayesr = colMeans(delta_sbayesr)

plot(pip_bayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="BayesR")

plot(pip_sbayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesR")

6 Part III - SBayesRC

6.1 Understanding the algorithm

This section helps you understand the R code used in the method.

Do not run the R code below in this section.

The algorithm for SBayesRC (an extention of SBayesR to incorporate functional annotations) is implemented in sbayesrc.R.

Here are the key differences between SBayesR and SBayesRC.

First, we need to initiate additional variables related to the SNP annotations.

  # annotation related variables
  snpPi = matrix(rep(pi, m), byrow=TRUE, nrow=m, ncol=ndist)  # per-SNP pi given the SNP annotations
  alpha_mcmc = matrix(0, niter, ncol(anno)*(ndist-1))         # MCMC samples of annotation effects
  p = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional probability p
  z = matrix(nrow = m, ncol = ndist-1)                        # per-SNP conditional distribution membership indicator 
  alpha = matrix(0, nrow = ncol(anno), ncol = ndist-1)        # vector of annotation effects
  sigmaSqAlpha = rep(1, ndist-1)                              # variance of annotation effects
  sigmaSqAlpha_mcmc = matrix(0, niter, ndist-1)               # MCMC samples of the variance of annotation effects
  nAnno = ncol(anno)                                          # number of annotations
  annoDiag = apply(anno, 2, crossprod)                        # diagonal values of A'A where A is the annotation matrix

Second, when sampling SNP effects, we record the conditional distribution membership for the SNP:

      if (delta > 1) {
        # ...
        for(j in 1:(delta-1)){  # set one to the "step up" indicator variable
          z[i,j] = 1
        }
      } else {
        # ...
      }

Next, there is an extra step to sample the annotation effects. It may look complicated, but the sampling scheme is similar to how we sample the SNP effects with individual-level data. The main difference is that in stead of using the genotype matrix as X and the phenotypes as y, here we use the annotation matrix as X and a latent variable (sampled from truncated normal distribution) as y whose mean is a linear combination of SNP annotations. The effect of an annotation in this model shows how much changes in the prior probability of the SNP with a nonzero effect.

    # Step 4. Sample the annotation effects on the SNP distribution mixing probabilities
    for (j in 1:(ndist-1)) {  # have ndist-1 number of conditional distributions
      if (j==1) {
        idx = 1:m             # for the first conditional distribution, data are all SNPs
        annoDiagj = annoDiag  # diagonal values of A'A
      } else {
        idx = which(z[,j-1] > 0)  # for the subsequent conditional distributions, data are the SNPs in the previous distribution
        if (length(idx) > 1) {
          annoDiagj = apply(as.matrix(anno[idx,]), 2, crossprod)  # recompute A'A depending on the SNP memberships
        }
      }
      
      if (length(idx)) {
        zj = z[idx,j]
        mu = anno[idx,] %*% matrix(alpha[,j])  # linear combination of annotations, which will be the mean of truncated normal distribution
        lj = array(0, length(zj))   # latent variable
        if (sum(zj==0)) lj[zj==0] = rtruncnorm(sum(zj==0), mean = mu[zj==0], sd = 1, a = -Inf, b = 0)   # sample latent variable
        if (sum(zj==1)) lj[zj==1] = rtruncnorm(sum(zj==1), mean = mu[zj==1], sd = 1, a = 0, b =  Inf)   # sample latent variable
        # sample annotation effects using Gibbs sampler (similar to the SNP effect sampler in the individual-level model)
        lj = lj - c(mu)  # adjust the latent variable by all annotation effects
        # sample intercept with a flat prior
        oldSample = alpha[1,j]
        rhs = sum(lj) + m*oldSample
        invLhs = 1.0/m
        ahat = invLhs*rhs
        alpha[1,j] = rnorm(1, ahat, sqrt(invLhs))
        lj = lj + (oldSample - alpha[1,j])
        # sample each annotation effect with a normal prior
        if (nAnno > 1) {
          for (k in 2:nAnno) {
            oldSample = alpha[k,j]
            rhs = crossprod(anno[idx,k], lj) + annoDiagj[k]*oldSample;
            invLhs = 1.0/(annoDiagj[k] + 1.0/sigmaSqAlpha[j])
            ahat = invLhs*rhs
            alpha[k,j] = rnorm(1, ahat, sqrt(invLhs))
            lj = lj + anno[idx,k]*(oldSample - alpha[k,j])
          }
          # sample annotation effect variance from a scaled inverse chi-square distribution
          sigmaSqAlpha[j] = (sum(alpha[-1,j]^2) + 2)/rchisq(1, nAnno-1+2)
        }
      }
    }
    alpha_mcmc[iter,] = c(alpha)  # store MCMC samples of annotation effects 
    sigmaSqAlpha_mcmc[iter,] = sigmaSqAlpha

Given the sampled annotation effects, we can then compute the conditional probabilities for each SNP which determine how likely the SNP is to belong to higher-effect mixture components.

    # Step 5. Compute per-SNP conditional probabilities from the annotation effects
    p = apply(alpha, 2, function(x){pnorm(anno %*% x)})

Given the conditional probabilities, we can compute the joint probabilities (\(\pi\)) for each SNP belong to the mixture distribution components:

    # Step 6. Compute the joint probabilities (pi) from the conditional probabilities (p)
    for (k in 1:ndist) {
      if (k < (ndist-1)) {
        snpPi[,k] = 1.0 - p[,k]
      } else {
        snpPi[,k] = 1
      }
      if (k > 1) {
        for (kk in 1:(k-1)) {
          snpPi[,k] = snpPi[,k]*p[,kk]
        }
      }
    }
    #for example, we want
    #snpPi[,1] = 1 - p[,1]
    #snpPi[,2] = (1-p[,2])*p[,1]
    #snpPi[,3] = (1-p[,3])*p[,1]*p[,2]
    #snpPi[,4] = p[,1]*p[,2]*p[,3]

These \(\pi\) values are then used in the next iteration of SNP effect sampling, just like in SBayesR.

6.2 Data analysis

Due to extensive LD among SNPs, it’s often difficult to identify causal variants, especially those with smaller effects. For example, SNP 5 only has a PIP of about 0.5. Functional annotations can provide orthogonal information to LD, helping to better identify causal variants. Here, we demonstrate how this works in principle.

A simplified version of SBayesRC algorithm is implemented in sbayesrc.R. It uses LD correlation matrix rather than eigen-decomposition data as described in the original paper.

source("sbayesrc.R")

SBayesRC requires a table of SNP annotations. Suppose SNP 1, 3, and 5 (where SNP 1 and 5 are causal variants) are non-synonymous variants. We can add a binary annotate for all SNPs indicating whether they are non-synonymous. For illustration purpose, this annotation is designed to be informative, as it covers the two causal variants but also includes a null SNP, making the scenario slightly more realistic.

The annotation table has the size of number of SNPs (rows) by number of annotations (could be multiple annotations) plus a column of one as the first column (an intercept for the generalised linear model that links annotations to SNP effects).

int = rep(1,10)   # a vector of one as intercept
nonsynonymous = c(1,0,1,0,1,0,0,0,0,0)  # whether the SNP is non-synoymous
anno = cbind(int, nonsynonymous)
print(anno)
##       int nonsynonymous
##  [1,]   1             1
##  [2,]   1             0
##  [3,]   1             1
##  [4,]   1             0
##  [5,]   1             1
##  [6,]   1             0
##  [7,]   1             0
##  [8,]   1             0
##  [9,]   1             0
## [10,]   1             0

We are now ready to run SBayesRC using this annotation matrix. Although we provide annotation information, no prior weights are assigned. The method learns the impact of annotations jointly with SNP effects from the data. This is a unified Bayesian approach, unlike stepwise approaches that estimate annotation enrichment before fitting the model.

res_sbayesrc = sbayesrc(b, se, nind, R, anno)
## 
##  iter  100, nnz =    3, sigmaSq =  3.505, h2 =  0.430, vare =  0.609 
## 
##  iter  200, nnz =    2, sigmaSq =  8.816, h2 =  0.424, vare =  0.556 
## 
##  iter  300, nnz =    4, sigmaSq =  1.370, h2 =  0.577, vare =  0.399 
## 
##  iter  400, nnz =    5, sigmaSq =  1.531, h2 =  0.467, vare =  0.474 
## 
##  iter  500, nnz =    1, sigmaSq =  8.006, h2 =  0.471, vare =  0.537 
## 
##  iter  600, nnz =    2, sigmaSq =  8.698, h2 =  0.459, vare =  0.521 
## 
##  iter  700, nnz =    2, sigmaSq =  3.711, h2 =  0.467, vare =  0.527 
## 
##  iter  800, nnz =    3, sigmaSq =  1.367, h2 =  0.511, vare =  0.538 
## 
##  iter  900, nnz =    2, sigmaSq =  1.345, h2 =  0.407, vare =  0.623 
## 
##  iter 1000, nnz =    3, sigmaSq =  1.832, h2 =  0.428, vare =  0.557 
## 
## Posterior mean of model parameters:
##         Pi1         Pi2         Pi3         Pi4         Nnz     SigmaSq 
## 0.697659448 0.089256972 0.213083580 0.003803306 2.901000000 2.912358373 
##          h2        Vare 
## 0.496914173 0.500841968 
## 
## Annotation effects:
##                      p2        p3         p4
## int           -1.247819 2.4375043 -8.2470279
## nonsynonymous  1.471441 0.1569215 -0.3325109
## 
## Conditional probabilities:
##                   p2     p3 p4
## int           0.1060 0.9926  0
## nonsynonymous 0.5885 0.9953  0
## 
## Joint probabilities:
##                  Pi1    Pi2    Pi3 Pi4
## int           0.8940 0.0008 0.1053   0
## nonsynonymous 0.4115 0.0028 0.5857   0

Let’s have a look at the output. First, you may have noticed that Nnz (the number of nonzero effects) substantially deceased compared to (S)BayesR about, which is now close to the number of simulated causal variants. Second, the biggest annotation effect is observed in the cell of p2 and nonsynonymous. This mean the annotation has greatly increased the prior probability of having a nonzero effects for the annotated SNPs. This can also be seen from the Conditional probabilities result, where the cell of p2 and nonsynonymous is close to one, indicating the posterior probability that an annotated SNP will move from zero effect to a nonzero effect distribution is almost one! From the Joint probabilities result, we can see that nearly all annotated SNPs are in the medium effect size distribution (Pi3).


Remark: These results suggest that the annotation is highly informative. It plays a crucial role in distinguishing causal variants from non-causal ones, significantly enhancing the model’s ability to identify true signals.

To evaluate this, let’s check the PIP of SNPs from SBayesRC.

delta_sbayesrc = (res_sbayesrc$beta != 0)  # indicator variable for each SNP in each MCMC cycle
pip_sbayesrc = colMeans(delta_sbayesrc)    # frequency of the indicator variable being one across MCMC cycles

plot(pip_sbayesrc, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesRC")

Question: How would you interpret the result?




Answer: Both SNP 1 and 5 (the true causal variants) stand out with high PIPs, indicating strong posterior probability of causality, partly due to their functional annotations. Notably, SNP 5 did not show strong association evidence in the (S)BayesR analysis above, which does not incorporate annotations. Although SNP 3 shares the same annotation, it does not have a high PIP. This is because posterior probability reflects a combination of likelihood from GWAS and the prior information from functional annotation.

Let’s also check how incorporating annotations affects polygenic prediction.

beta_sbayesrc = colMeans(res_sbayesrc$beta)
ghat_sbayesrc = Xval %*% beta_sbayesrc
# prediction accuracy
print(summary(lm(yval ~ ghat_sbayesrc))$r.squared)
## [1] 0.5204186