In this practical, we will learn how to perform polygenic prediction using Bayesian methods. Part 1 introduces a Bayesian method, BayesR, using individual genotypes and phenotypes. Part 2 extends the method to SBayesR (summary-data-based BayesR) which requires GWAS summary statistics only. Part 3 further enhances the method by incorporating functional annotations (SBayesRC). The goal of this practical exercise is to understand the core algorithm of these Bayesian methods and the principle of Markov chain Monte Carlo (MCMC). Therefore, we will use a small data set and R scripts. In practice, I recommend use GCTB (https://cnsgenomics.com/software/gctb/#Overview) to perform real data analysis which can efficiently analyze large-scale genomic and trait data.
To run the provided R scripts, you need to install the following packages in your R or Rstudio:
install.packages("MCMCpack")
install.packages("truncnorm")
The data set for this practical exercise includes:
The trait was simulated such that SNP 1 has an effect size of 2 and SNP 5 has an effect size of 1. The trait heritability is set at 0.5.
Load the data in R.
# data for training population
data_path="./"
X <- as.matrix(read.table(paste0(data_path,"xmat_trn.txt")))
y <- as.matrix(read.table(paste0(data_path,"yvec_trn.txt")))
# data for validation population
Xval <- as.matrix(read.table(paste0(data_path,"xmat_val.txt")))
yval <- as.matrix(read.table(paste0(data_path,"yvec_val.txt")))
BayesR method for PGS prediction using individual-level genotype and phenotype data.
This section helps you understand the R code used in the method.
Different Bayesian methods vary only in the prior specification for
SNP effects. Here, we focus on a Bayesian method called BayesR. The
algorithm for BayesR with MCMC sampling scheme is implemented in the R
script bayesr.R.
Do not run the R code below in this section.
BayesR assumes each SNP effect follows a mixture of normal distributions. The input parameters include
# Input parameters for bayesr. Do not run this.
bayesr = function(X, y, niter, gamma, startPi, startH2){
...
}
niter is the number of iterations for MCMC
samplingX is the genotype matrixy is the vector of phenotypesgamma is a vector of scaling factors for the variance
of the mixture componentsstartPi is a vector of starting values for the mixture
proportions \(\pi\)startH2 is the starting value of SNP-based
heritability.The number of elements in gamma and startPi
determines the number of mixture components, and they must match.
Let’s dive into more details to understand what the code does.
The first step is to declare and initialise variables. This step also
adjusts y for the population mean and SNP effects (which are initially
set to zero), so that ycorr represents the residuals.
# Step 1. Declaration and initialisation of variables
n = nrow(X) # number of observations
m = ncol(X) # number of SNPs
ndist = length(startPi) # number of mixture distributions
pi = startPi # starting value for pi
h2 = startH2 # starting value for heritability
vary = var(y) # phenotypic variance
varg = vary*h2 # starting value for genetic variance
vare = vary*(1-h2) # starting value for residual variance
sigmaSq = varg/(m*sum(gamma*pi)) # common factor of SNP effect variance
nub = 4 # prior degrees of freedom for SNP effect variance
nue = 4 # prior degrees of freedom for residual variance
scaleb = (nub-2)/nub*sigmaSq # prior scale parameter for SNP effect variance
scalee = (nue-2)/nue*vare # prior scale parameter for residual variance
beta = array(0,m) # vector of SNP effects
beta_mcmc = matrix(0,niter,m) # MCMC samples of SNP effects
mu = mean(y) # overall mean
xpx = apply(X, 2, crossprod) ## diagonal elements of X'X
probDelta = vector("numeric", ndist)
logDelta = array(0,2)
keptIter = NULL # keep MCMC samples
# adjust/correct y for population mean and all SNP effects (which all have zero as initial value)
ycorr = y - mu
Then the Gibbs sampling begins. Gibbs sampling iteratively draws samples from the full conditional distribution of each parameter (i.e., posterior distribution conditional on the values of other parameters). Over time, this converges to the joint posterior distribution.
The first step within each iteration is to sample the population mean (fixed effect) from its full conditional distribution, assuming a flat prior. This distribution is centered at the BLUE (best linear unbiased estimator) solution, with variance equal to the inverse of the coefficient matrix in the “mini” mixed model equation for the mean.
# MCMC begins
for (iter in 1:niter){
# Step 2. Sample the mean from a normal posterior
ycorr = ycorr + mu # unadjust y with the old sample
rhs = sum(ycorr) # right hand side of the mixed model equation
invLhs = 1/n # inverse of left hand side of the mixed model equation
muHat = invLhs*rhs # BLUE solution
mu = rnorm(1, muHat, sqrt(invLhs*vare)) # posterior is a normal distribution
ycorr = ycorr - mu # adjust y with the new sample
The next step is to sample the mixture component indicator variable
and effect size for each SNP, conditional on the effects of all the
other SNPs. We first sample the indicator variable delta
for the mixture distribution membership (multinomial posterior), and
then sample the SNP effect beta from the corresponding
normal distribution or a point mass at zero.
# Step 3. Sample each SNP effect from a multi-normal mixture distribution
logPi = log(pi)
logPiComp = log(1-pi)
invSigmaSq = 1/(gamma*c(sigmaSq))
logSigmaSq = log(gamma*c(sigmaSq))
nsnpDist = rep(0, ndist) # number of SNPs asigned to each mixture component
ssq = 0 # weighted sum of squares of SNP effects
nnz = 0 # number of nonzero effects
ghat = array(0,n) # individual genetic value
# loop over SNPs
for (j in 1:m){
oldSample = beta[j]
rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
# sample the mixture distribution membership
logDelta = 0.5*(log(invLhs) - logSigmaSq + betaHat*c(rhs)) + logPi # log likelihood + prior for nonzero effects
logDelta[1] = logPi[1] # log likelihood + prior for zero effect
for (k in 1:ndist) {
probDelta[k] = 1.0/sum(exp(logDelta - logDelta[k])) # posterior probability for each distribution membership
}
delta = sample(1:ndist, 1, prob = probDelta) # indicator variable for the distribution membership
nsnpDist[delta] = nsnpDist[delta] + 1
if (delta > 1) {
beta[j] = rnorm(1, betaHat[delta], sqrt(invLhs[delta])) # given the distribution membership, the posterior is a normal distribution
ycorr = ycorr + X[,j]*(oldSample - beta[j]) # update ycorr with the new sample
ghat = ghat + X[,j]*beta[j]
ssq = ssq + beta[j]^2 / gamma[delta]
nnz = nnz + 1
} else {
if (oldSample) ycorr = ycorr + X[,j]*oldSample # update ycorr with the new sample which is zero
beta[j] = 0
}
}
beta_mcmc[iter,] = beta
Next, we sample the other parameters in the model, including mixing probabilities (\(\pi\)), SNP effect variance (\(\sigma^2_{\beta}\)), and residual variance (\(\sigma^2_e\)). The full conditional distribution for \(\pi\) is a beta distribution. For \(\sigma^2_{\beta}\) and \(\sigma^2_e\), they are scaled inverse chi-square distributions.
# Step 4. Sample the distribution membership probabilities from a Dirichlet distribution
pi = rdirichlet(1, nsnpDist + 1)
# Step 5. Sample the SNP effect variance from a scaled inverse chi-square distribution
sigmaSq = (ssq + nub*scaleb)/rchisq(1, nnz+nub)
# Step 6. Sample the residual variance from a scaled inverse chi-square distribution
vare = (crossprod(ycorr) + nue*scalee)/rchisq(1, n+nue)
Given the genetic values and residual variance, we can compute the SNP-based heritability. This is an appealing property of the MCMC approach – given the sampled values of SNP effects, you can easily compute any statistics of interest.
# Step 7. Compute the SNP-based heritability
varg = var(ghat)
h2 = varg/(varg + vare)
The final step in each iteration is to store the MCMC samples for the parameter values
keptIter <- rbind(keptIter,c(pi, nnz, sigmaSq, h2, vare, varg))
After completing all MCMC iterations, we compute posterior means as point estimates.
colnames(keptIter) <- c(paste0("Pi", 1:length(pi)),"Nnz","SigmaSq","h2","Vare","Varg")
postMean = apply(keptIter, 2, mean) # posterior mean of MCMC samples is used as the point estimate for each parameter
cat("\nPosterior mean:\n")
print(postMean)
return(list(par=keptIter, beta=beta_mcmc))
Let’s load the script in R.
source("bayesr.R")
BayesR prior assumes that some SNPs have zero effect, some have small effects, and some have large effects, by assuming a mixture of multiple normal distributions, including a point mass at zero.
\[ \beta_j \sim \sum_k \pi_k N(0, \gamma_k \sigma^2_\beta) \]
For example, we can set a 4-component mixture model:
gamma = c(0, 0.01, 0.1, 1)
startPi = c(0.5, 0.3, 0.15, 0.05)
res_bayesr = bayesr(X = X, y = y, gamma = gamma, startPi = startPi)
##
## iter 100, nnz = 6, sigmaSq = 2.934, h2 = 0.519, vare = 4.043, varg = 4.361
##
## iter 200, nnz = 10, sigmaSq = 2.360, h2 = 0.571, vare = 3.154, varg = 4.193
##
## iter 300, nnz = 5, sigmaSq = 3.638, h2 = 0.446, vare = 4.391, varg = 3.529
##
## iter 400, nnz = 3, sigmaSq = 8.631, h2 = 0.464, vare = 4.076, varg = 3.533
##
## iter 500, nnz = 2, sigmaSq = 6.698, h2 = 0.491, vare = 4.126, varg = 3.986
##
## iter 600, nnz = 8, sigmaSq = 5.785, h2 = 0.505, vare = 3.680, varg = 3.757
##
## iter 700, nnz = 5, sigmaSq = 4.445, h2 = 0.548, vare = 3.550, varg = 4.310
##
## iter 800, nnz = 7, sigmaSq = 5.414, h2 = 0.532, vare = 3.607, varg = 4.092
##
## iter 900, nnz = 8, sigmaSq = 1.477, h2 = 0.480, vare = 4.140, varg = 3.821
##
## iter 1000, nnz = 3, sigmaSq = 4.810, h2 = 0.487, vare = 3.959, varg = 3.763
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.3028915 0.3123460 0.1926270 0.1921354 6.8590000 4.1479521 0.5002787 3.8989247
## Varg
## 3.9123225
The output includes sampled values of key parameters every 100 iterations.
After MCMC, you can find the sampled values for the model parameters and SNP effects for each iteration in the result list. For example, you can summarise the posterior mean and standard deviation for each parameter by
colMeans(res_bayesr$par)
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.3028915 0.3123460 0.1926270 0.1921354 6.8590000 4.1479521 0.5002787 3.8989247
## Varg
## 3.9123225
apply(res_bayesr$par, 2, sd)
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2
## 0.19183050 0.21272367 0.16415316 0.12186105 2.21946954 4.37814400 0.03403552
## Vare Varg
## 0.31663773 0.41436110
The posterior mean gives a point estimate, and the standard eviation gives an estimate of uncertainty (posterior standard error).
You can plot parameter traces over iterations to check convergence. For example:
# Trace plot for SNP 1, which is the causal variant with effect size of 2
plot(1:nrow(res_bayesr$beta), res_bayesr$beta[,1], xlab="Iteration", ylab="beta[,1]")
abline(h = 2, col="red")
# Trace plot for SNP 2, which is a null SNP
plot(1:nrow(res_bayesr$beta), res_bayesr$beta[,2], xlab="Iteration", ylab="beta[,2]")
abline(h = 0, col="red")
# Trace plot for SNP 5, which is a causal variant with a smaller effect of 1 and in LD with other SNPs
plot(1:nrow(res_bayesr$beta), res_bayesr$beta[,5], xlab="Iteration", ylab="beta[,5]")
abline(h = 1, col="red")
Remarks:
This indicates good mixing and convergence.
We can also plot the posterior distribution for each SNP effect. We discard the first 100 iterations of the program as “burn in”:
# Posterior distribution of SNP 1 effect
plot(density(res_bayesr$beta[100:1000,1]))
# Posterior distribution of SNP 2 effect
plot(density(res_bayesr$beta[100:1000,2]))
Remarks:
We can measure SNP-trait association by computing the posterior inclusion probability (PIP), the probability of the SNP being fitted as a non-zero effect in the model across MCMC samples. It has been widely used in the fine-mapping analysis.
For example, PIP for SNP 1 can be calculated as
mean(res_bayesr$beta[100:1000,1] != 0)
## [1] 1
This could be generalised for all SNPs:
pip = colMeans(res_bayesr$beta[100:1000, ] != 0)
plot(pip, type="h", xlab="SNP", ylab="Posterior inclusion probability")
Question: Which SNPs are likely to be the causal variants based on PIP?
Answer: SNPs with high PIP (close to 1) are likely to be causal (note that SNP 1 and SNP 5 are causal variants in this simulation). A plot of PIP across SNPs will highlight those with strong support. SNPs with PIP near 0 are unlikely to be associated.
To get the SNP effect estimates, we calculate the posterior means of SNP effects:
betaMean = colMeans(res_bayesr$beta)
Question: How to predict PGS using BayesC effect estimates? Does the prediction accuracy make sense?
Answer:
ghat_bayesr = Xval %*% betaMean
summary(lm(yval ~ ghat_bayesr))$r.squared
## [1] 0.5267547
Unlike the conventional method C+PT, BayesR provides SNP joint effect estimates which can be directly used to construct PGS without LD pruning or SNP selection. Due to the use of a mixture prior, BayesR allows effect sizes to be exactly zero, which helps avoid overfitting noise. The theoretical upper bound for prediction accuracy is the simulated heritability 0.5. The observed prediction accuracy is slighly over 0.5 because of large sampling variance due to the small validation sample size.
This section helps you understand the R code used in the method.
Do not run the R code below in this section.
The algorithm for SBayesR (Summary-data-based BayesR) is implemented
in sbayesr.R.
Here we focus on the code that is different from
individual-data-based BayesR implemented in bayesr.R.
Compared to BayesR, SBayesR includes an additional step at the beginning to scale all GWAS marginal effects so that they are in the per-genotype-SD unit with phenotypic variance equal to 1. GWAS is usually performed using the 0/1/2 genotypes, whereas the algorithm assumes standardised genotypes, as shown in the lecture.
# Step 1. scale the GWAS SNP effects
scale = sqrt(1/(n*se^2 + b^2)) # scale factors for marginal SNP effects
b_scaled = b*scale # scaled marginal SNP effects (in units of genotype sd / phenotype sd)
vary = 1 # phenotypic variance = 1 after the scaling
Once MCMC begins, the next step is to sample SNP effects. Unlike BayesR, SBayesR does not include fixed effects, because they have already been adjusted in the GWAS. For each SNP, if its effect is not zero, we sample it from a normal distribution where the mean is the BLUP solution to the “per-SNP mixed model equation”. The only difference in this part between SBayesR and BayesR is the use of summary data equivalents:
rhs = n*bcorr[i] + n*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(n/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
# if the effect is not zero, we sample it from a normal distribution
beta[i] = rnorm(1, betaHat[delta], sqrt(invLhs[delta])) # given the distribution membership, the posterior is a normal distribution
###################################################################
# In contrast, this is what we have in individual-data-based BayesR
rhs = crossprod(X[,j], ycorr) + xpx[j]*oldSample # right hand side of the mixed model equation
rhs = rhs/vare
invLhs = 1.0/(xpx[j]/c(vare) + invSigmaSq) # inverse of left hand side of the mixed model equation
betaHat = invLhs*c(rhs) # BLUP solution
###################################################################
After sampling the SNP effect \(\beta_j\), instead of adjusting \(y\) (as in BayesR), we update \(b_{corr}\) in SBayesR. This strategy is known as “right-hand side updating”, because it updates the vector of \(X'y\) (\(=nb\)), the right-hand side of the mixed model equations.
if (delta > 1) {
# ...
bcorr = bcorr + R[,i]*(oldSample - beta[i]) # update bhatcorr with the new sample
# ...
} else {
if (oldSample) bcorr = bcorr + R[,i]*oldSample # update bhatcorr with the new sample which is zero
# ...
}
###################################################################
# In contrast, this is what we have in individual-data-based BayesR
if (delta > 1) {
# ...
ycorr = ycorr + X[,j]*(oldSample - beta[j]) # update ycorr with the new sample
# ...
} else {
if (oldSample) ycorr = ycorr + X[,j]*oldSample # update ycorr with the new sample which is zero
# ...
}
###################################################################
This approach is efficient because \(b_{corr}\) is only updated for SNPs in LD with SNP \(i\), which is typically a smaller subset within an LD block.
The Gibbs samplers for \(\pi\) and \(\sigma^2_{\beta}\) are the same as in BayesR. To compute genetic variance, we use
\[\sigma^2_g = \beta'R\beta = \beta'(b - b_{corr})\] because \(b_{corr} = b - R\beta\). Since phenotypic variance is set to 1, SNP-based heritability equals the genetic variance.
# Step 5. Compute the SNP-based heritability
bRb = crossprod(beta, (b-bcorr)) # compute genetic variance = beta' R beta
varg = bRb
h2 = varg
We can also estimate the residual variance. However, unlike BayesR, the residual variance is not guaranteed to be positive. Issues like LD mismatches between GWAS and reference samples, variation in per-SNP sample size, or errors in the GWAS summary statistics can result in a negative residual variance estimate. When this happens, the algorithm cannot proceed, so we force the residual variance to equal 1 (phenotypic variance).
sse = (vary - varg)*n
vare = (sse + nue*scalee)/rchisq(1, n+nue)
if (vare <= 0) vare = vary # sometimes sse can be negative and would cause a problem
SBayesR method is implemented in sbayesr.R.
source("sbayesr.R")
First, we need to obtain GWAS effect estimates (using 0/1/2 genotypes):
# run GWAS on the 0/1/2 genotypes
fit = apply(X, 2, function(x){summary(lm(y~x))$coefficients[2,1:2]})
b = fit[1,]
se = fit[2,]
Compute LD correlation matrix:
R = cor(X)
Then we scale the GWAS effects to be in per-genotype-standard-deviation unit:
nind = nrow(X) # GWAS sample size
scale = sqrt(1/(nind*se^2 + b^2)) # calculate the scale factor for each SNP
b_scaled = b*scale # scale the marginal effects
Now we are ready to run SBayesR:
res_sbayesr = sbayesr(b, se, nind, R)
##
## iter 100, nnz = 2, sigmaSq = 1.151, h2 = 0.559, vare = 0.422
##
## iter 200, nnz = 2, sigmaSq = 1.569, h2 = 0.390, vare = 0.588
##
## iter 300, nnz = 2, sigmaSq = 0.998, h2 = 0.476, vare = 0.573
##
## iter 400, nnz = 3, sigmaSq = 1.061, h2 = 0.543, vare = 0.467
##
## iter 500, nnz = 7, sigmaSq = 1.458, h2 = 0.454, vare = 0.517
##
## iter 600, nnz = 2, sigmaSq = 10.218, h2 = 0.455, vare = 0.538
##
## iter 700, nnz = 6, sigmaSq = 0.757, h2 = 0.493, vare = 0.503
##
## iter 800, nnz = 3, sigmaSq = 3.030, h2 = 0.442, vare = 0.521
##
## iter 900, nnz = 4, sigmaSq = 5.699, h2 = 0.579, vare = 0.448
##
## iter 1000, nnz = 3, sigmaSq = 5.274, h2 = 0.521, vare = 0.463
##
## Posterior mean:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq h2 Vare
## 0.4582901 0.2420419 0.1867185 0.1129495 4.5690000 2.2151993 0.5039532 0.4953590
beta_sbayesr = colMeans(res_sbayesr$beta)
Run BayesR as benchmark:
beta_bayesr = colMeans(res_bayesr$beta)
Question: Are BayesR and SBayesR SNP effect estimates the same? What could possibly cause the difference?
cor(beta_bayesr, beta_sbayesr)
## [1] 0.9933062
plot(beta_bayesr, beta_sbayesr)
abline(a=0, b=1)
Answer: The small differences are mostly due to variation in MCMC sampling, known as Monte Carlo variance.
The posterior inclusion probability (PIP) are also expected to be similar between SBayesR and BayesR:
delta_bayesr = (res_bayesr$beta != 0) # indicator variable for each SNP in each MCMC cycle
pip_bayesr = colMeans(delta_bayesr) # frequency of the indicator variable being one across MCMC cycles
delta_sbayesr = (res_sbayesr$beta != 0)
pip_sbayesr = colMeans(delta_sbayesr)
plot(pip_bayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="BayesR")
plot(pip_sbayesr, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesR")
This section helps you understand the R code used in the method.
Do not run the R code below in this section.
The algorithm for SBayesRC (an extention of SBayesR to incorporate
functional annotations) is implemented in sbayesrc.R.
Here are the key differences between SBayesR and SBayesRC.
First, we need to initiate additional variables related to the SNP annotations.
# annotation related variables
snpPi = matrix(rep(pi, m), byrow=TRUE, nrow=m, ncol=ndist) # per-SNP pi given the SNP annotations
alpha_mcmc = matrix(0, niter, ncol(anno)*(ndist-1)) # MCMC samples of annotation effects
p = matrix(nrow = m, ncol = ndist-1) # per-SNP conditional probability p
z = matrix(nrow = m, ncol = ndist-1) # per-SNP conditional distribution membership indicator
alpha = matrix(0, nrow = ncol(anno), ncol = ndist-1) # vector of annotation effects
sigmaSqAlpha = rep(1, ndist-1) # variance of annotation effects
sigmaSqAlpha_mcmc = matrix(0, niter, ndist-1) # MCMC samples of the variance of annotation effects
nAnno = ncol(anno) # number of annotations
annoDiag = apply(anno, 2, crossprod) # diagonal values of A'A where A is the annotation matrix
Second, when sampling SNP effects, we record the conditional distribution membership for the SNP:
if (delta > 1) {
# ...
for(j in 1:(delta-1)){ # set one to the "step up" indicator variable
z[i,j] = 1
}
} else {
# ...
}
Next, there is an extra step to sample the annotation effects. It may
look complicated, but the sampling scheme is similar to how we sample
the SNP effects with individual-level data. The main difference is that
in stead of using the genotype matrix as X and the
phenotypes as y, here we use the annotation matrix as
X and a latent variable (sampled from truncated normal
distribution) as y whose mean is a linear combination of
SNP annotations. The effect of an annotation in this model shows how
much changes in the prior probability of the SNP with a nonzero
effect.
# Step 4. Sample the annotation effects on the SNP distribution mixing probabilities
for (j in 1:(ndist-1)) { # have ndist-1 number of conditional distributions
if (j==1) {
idx = 1:m # for the first conditional distribution, data are all SNPs
annoDiagj = annoDiag # diagonal values of A'A
} else {
idx = which(z[,j-1] > 0) # for the subsequent conditional distributions, data are the SNPs in the previous distribution
if (length(idx) > 1) {
annoDiagj = apply(as.matrix(anno[idx,]), 2, crossprod) # recompute A'A depending on the SNP memberships
}
}
if (length(idx)) {
zj = z[idx,j]
mu = anno[idx,] %*% matrix(alpha[,j]) # linear combination of annotations, which will be the mean of truncated normal distribution
lj = array(0, length(zj)) # latent variable
if (sum(zj==0)) lj[zj==0] = rtruncnorm(sum(zj==0), mean = mu[zj==0], sd = 1, a = -Inf, b = 0) # sample latent variable
if (sum(zj==1)) lj[zj==1] = rtruncnorm(sum(zj==1), mean = mu[zj==1], sd = 1, a = 0, b = Inf) # sample latent variable
# sample annotation effects using Gibbs sampler (similar to the SNP effect sampler in the individual-level model)
lj = lj - c(mu) # adjust the latent variable by all annotation effects
# sample intercept with a flat prior
oldSample = alpha[1,j]
rhs = sum(lj) + m*oldSample
invLhs = 1.0/m
ahat = invLhs*rhs
alpha[1,j] = rnorm(1, ahat, sqrt(invLhs))
lj = lj + (oldSample - alpha[1,j])
# sample each annotation effect with a normal prior
if (nAnno > 1) {
for (k in 2:nAnno) {
oldSample = alpha[k,j]
rhs = crossprod(anno[idx,k], lj) + annoDiagj[k]*oldSample;
invLhs = 1.0/(annoDiagj[k] + 1.0/sigmaSqAlpha[j])
ahat = invLhs*rhs
alpha[k,j] = rnorm(1, ahat, sqrt(invLhs))
lj = lj + anno[idx,k]*(oldSample - alpha[k,j])
}
# sample annotation effect variance from a scaled inverse chi-square distribution
sigmaSqAlpha[j] = (sum(alpha[-1,j]^2) + 2)/rchisq(1, nAnno-1+2)
}
}
}
alpha_mcmc[iter,] = c(alpha) # store MCMC samples of annotation effects
sigmaSqAlpha_mcmc[iter,] = sigmaSqAlpha
Given the sampled annotation effects, we can then compute the conditional probabilities for each SNP which determine how likely the SNP is to belong to higher-effect mixture components.
# Step 5. Compute per-SNP conditional probabilities from the annotation effects
p = apply(alpha, 2, function(x){pnorm(anno %*% x)})
Given the conditional probabilities, we can compute the joint probabilities (\(\pi\)) for each SNP belong to the mixture distribution components:
# Step 6. Compute the joint probabilities (pi) from the conditional probabilities (p)
for (k in 1:ndist) {
if (k < (ndist-1)) {
snpPi[,k] = 1.0 - p[,k]
} else {
snpPi[,k] = 1
}
if (k > 1) {
for (kk in 1:(k-1)) {
snpPi[,k] = snpPi[,k]*p[,kk]
}
}
}
#for example, we want
#snpPi[,1] = 1 - p[,1]
#snpPi[,2] = (1-p[,2])*p[,1]
#snpPi[,3] = (1-p[,3])*p[,1]*p[,2]
#snpPi[,4] = p[,1]*p[,2]*p[,3]
These \(\pi\) values are then used in the next iteration of SNP effect sampling, just like in SBayesR.
Due to extensive LD among SNPs, it’s often difficult to identify causal variants, especially those with smaller effects. For example, SNP 5 only has a PIP of about 0.5. Functional annotations can provide orthogonal information to LD, helping to better identify causal variants. Here, we demonstrate how this works in principle.
A simplified version of SBayesRC algorithm is implemented in
sbayesrc.R. It uses LD correlation matrix rather than
eigen-decomposition data as described in the original paper.
source("sbayesrc.R")
SBayesRC requires a table of SNP annotations. Suppose SNP 1, 3, and 5 (where SNP 1 and 5 are causal variants) are non-synonymous variants. We can add a binary annotate for all SNPs indicating whether they are non-synonymous. For illustration purpose, this annotation is designed to be informative, as it covers the two causal variants but also includes a null SNP, making the scenario slightly more realistic.
The annotation table has the size of number of SNPs (rows) by number of annotations (could be multiple annotations) plus a column of one as the first column (an intercept for the generalised linear model that links annotations to SNP effects).
int = rep(1,10) # a vector of one as intercept
nonsynonymous = c(1,0,1,0,1,0,0,0,0,0) # whether the SNP is non-synoymous
anno = cbind(int, nonsynonymous)
print(anno)
## int nonsynonymous
## [1,] 1 1
## [2,] 1 0
## [3,] 1 1
## [4,] 1 0
## [5,] 1 1
## [6,] 1 0
## [7,] 1 0
## [8,] 1 0
## [9,] 1 0
## [10,] 1 0
We are now ready to run SBayesRC using this annotation matrix. Although we provide annotation information, no prior weights are assigned. The method learns the impact of annotations jointly with SNP effects from the data. This is a unified Bayesian approach, unlike stepwise approaches that estimate annotation enrichment before fitting the model.
res_sbayesrc = sbayesrc(b, se, nind, R, anno)
##
## iter 100, nnz = 3, sigmaSq = 3.505, h2 = 0.430, vare = 0.609
##
## iter 200, nnz = 2, sigmaSq = 8.816, h2 = 0.424, vare = 0.556
##
## iter 300, nnz = 4, sigmaSq = 1.370, h2 = 0.577, vare = 0.399
##
## iter 400, nnz = 5, sigmaSq = 1.531, h2 = 0.467, vare = 0.474
##
## iter 500, nnz = 1, sigmaSq = 8.006, h2 = 0.471, vare = 0.537
##
## iter 600, nnz = 2, sigmaSq = 8.698, h2 = 0.459, vare = 0.521
##
## iter 700, nnz = 2, sigmaSq = 3.711, h2 = 0.467, vare = 0.527
##
## iter 800, nnz = 3, sigmaSq = 1.367, h2 = 0.511, vare = 0.538
##
## iter 900, nnz = 2, sigmaSq = 1.345, h2 = 0.407, vare = 0.623
##
## iter 1000, nnz = 3, sigmaSq = 1.832, h2 = 0.428, vare = 0.557
##
## Posterior mean of model parameters:
## Pi1 Pi2 Pi3 Pi4 Nnz SigmaSq
## 0.697659448 0.089256972 0.213083580 0.003803306 2.901000000 2.912358373
## h2 Vare
## 0.496914173 0.500841968
##
## Annotation effects:
## p2 p3 p4
## int -1.247819 2.4375043 -8.2470279
## nonsynonymous 1.471441 0.1569215 -0.3325109
##
## Conditional probabilities:
## p2 p3 p4
## int 0.1060 0.9926 0
## nonsynonymous 0.5885 0.9953 0
##
## Joint probabilities:
## Pi1 Pi2 Pi3 Pi4
## int 0.8940 0.0008 0.1053 0
## nonsynonymous 0.4115 0.0028 0.5857 0
Let’s have a look at the output. First, you may have noticed that
Nnz (the number of nonzero effects) substantially deceased
compared to (S)BayesR about, which is now close to the number of
simulated causal variants. Second, the biggest annotation effect is
observed in the cell of p2 and nonsynonymous.
This mean the annotation has greatly increased the prior probability of
having a nonzero effects for the annotated SNPs. This can also be seen
from the Conditional probabilities result, where the cell
of p2 and nonsynonymous is close to one,
indicating the posterior probability that an annotated SNP will move
from zero effect to a nonzero effect distribution is almost one! From
the Joint probabilities result, we can see that nearly all
annotated SNPs are in the medium effect size distribution
(Pi3).
Remark: These results suggest that the annotation is highly informative. It plays a crucial role in distinguishing causal variants from non-causal ones, significantly enhancing the model’s ability to identify true signals.
To evaluate this, let’s check the PIP of SNPs from SBayesRC.
delta_sbayesrc = (res_sbayesrc$beta != 0) # indicator variable for each SNP in each MCMC cycle
pip_sbayesrc = colMeans(delta_sbayesrc) # frequency of the indicator variable being one across MCMC cycles
plot(pip_sbayesrc, type="h", xlab="SNP", ylab="Posterior inclusion probability", main="SBayesRC")
Question: How would you interpret the result?
Answer: Both SNP 1 and 5 (the true causal variants) stand out with high PIPs, indicating strong posterior probability of causality, partly due to their functional annotations. Notably, SNP 5 did not show strong association evidence in the (S)BayesR analysis above, which does not incorporate annotations. Although SNP 3 shares the same annotation, it does not have a high PIP. This is because posterior probability reflects a combination of likelihood from GWAS and the prior information from functional annotation.
Let’s also check how incorporating annotations affects polygenic prediction.
beta_sbayesrc = colMeans(res_sbayesrc$beta)
ghat_sbayesrc = Xval %*% beta_sbayesrc
# prediction accuracy
print(summary(lm(yval ~ ghat_sbayesrc))$r.squared)
## [1] 0.5204186