PCA of NPD summary statistics

Author

Saikat Banerjee

Published

May 16, 2023

Abstract
We apply PCA (and weighted PCA) on our input data and plot the principal components labeled with the broad phenotypes

About

For multi-trait analysis of GWAS summary statistics, we look at one of the simplest models - the principal component analysis (PCA). The goal is to characterize the latent components of genetic associations. We apply PCA to the matrix of summary statistics derived from GWAS across 80 NPD phenotypes from various sources – namely, GTEx, OpenGWAS and PGC. For a similar comprehensive study with the UK Biobank data, see Tanigawa et al., Nat. Comm. 2019.

The \mathbf{X} matrix of size N \times P for the PCA is characterized by N phenotypes (samples) and P variants (features).

Code
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from pymir import mpl_stylesheet
from pymir import mpl_utils

from sklearn.decomposition import PCA

mpl_stylesheet.banskt_presentation(splinecolor = 'black', dpi = 120, colors = 'kelly')

Read summary statistics

Code
data_dir = "../data"
beta_df_filename   = f"{data_dir}/beta_df.pkl"
prec_df_filename   = f"{data_dir}/prec_df.pkl"
se_df_filename     = f"{data_dir}/se_df.pkl"
zscore_df_filename = f"{data_dir}/zscore_df.pkl"

beta_df   = pd.read_pickle(beta_df_filename)
prec_df   = pd.read_pickle(prec_df_filename)
se_df     = pd.read_pickle(se_df_filename)
zscore_df = pd.read_pickle(zscore_df_filename)

trait_df = pd.read_csv(f"{data_dir}/trait_meta.csv")
phenotype_dict = trait_df.set_index('ID')['Broad'].to_dict()

After filtering, there are 100068 variants for 69 phenotypes, summarized below.

beta_df
AD_sumstats_Jansenetal_2019sept.txt.gz CNCR_Insomnia_all GPC-NEO-NEUROTICISM IGAP_Alzheimer Jones_et_al_2016_Chronotype Jones_et_al_2016_SleepDuration MDD_MHQ_BIP_METACARPA_INFO6_A5_NTOT_no23andMe_... MDD_MHQ_METACARPA_INFO6_A5_NTOT_no23andMe_noUK... MHQ_Depression_WG_MAF1_INFO4_HRC_Only_Filtered... MHQ_Recurrent_Depression_WG_MAF1_INFO4_HRC_Onl... ... ieu-b-7 ieu-b-8 ieu-b-9 ocd_aug2017.txt.gz pgc-bip2021-BDI.vcf.txt.gz pgc-bip2021-BDII.vcf.txt.gz pgc-bip2021-all.vcf.txt.gz pgc.scz2 pgcAN2.2019-07.vcf.txt.gz pts_all_freeze2_overall.txt.gz
rs1000031 -0.002240 -0.003273 0.1289 0.0072 -0.000698 0.000712 0.001170 0.005629 -0.016636 -0.046173 ... 0.0124 NaN NaN -0.006995 0.012896 -0.005596 0.012798 -0.004802 -0.022505 -0.022603
rs1000269 -0.002631 -0.009901 0.0708 -0.0286 -0.010576 -0.004509 0.001181 0.000982 0.025008 0.044356 ... 0.0373 -0.005293 -0.008693 0.003400 -0.002603 0.006598 -0.008597 0.008801 0.009202 -0.008702
rs10003281 -0.005380 0.060423 -0.4361 -0.0037 -0.011744 0.022551 NaN NaN NaN NaN ... -0.0344 0.111759 0.103478 0.097997 0.011296 -0.091501 -0.006099 0.000500 NaN -0.036700
rs10004866 0.000037 0.019230 0.2035 -0.0294 -0.000277 -0.008780 -0.021749 -0.023538 -0.000016 -0.034849 ... -0.0316 -0.027825 -0.011386 -0.035900 0.016601 0.036197 0.027002 -0.027703 -0.021801 0.015499
rs10005235 0.002221 -0.007645 0.1255 0.0108 -0.016947 0.011787 0.003163 0.002522 0.016984 0.004132 ... 0.0159 -0.004489 -0.021064 0.010604 -0.028399 0.068602 -0.021499 0.045102 0.025605 -0.006400
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
rs9989571 0.000085 -0.002726 NaN 0.0184 0.002450 -0.010379 -0.001487 -0.009087 0.024389 0.023753 ... -0.0383 -0.008719 0.012676 -0.100198 -0.014302 -0.054097 -0.037401 0.047999 0.028502 -0.021595
rs9991694 -0.004170 -0.011018 NaN -0.0150 0.007666 -0.013183 0.031620 0.034180 -0.062177 -0.058765 ... 0.0021 NaN NaN -0.185500 -0.015103 0.000800 -0.015103 0.033102 NaN 0.017299
rs9992763 0.001502 -0.000098 NaN -0.0067 -0.000568 0.002275 0.000704 0.002954 0.013290 0.001284 ... 0.0033 -0.000539 0.012932 -0.007899 -0.000600 0.015401 0.000100 0.004898 -0.006896 -0.018802
rs9993607 -0.003670 -0.003879 NaN 0.0005 0.000622 -0.000486 0.031370 0.036475 -0.044404 -0.045986 ... -0.0161 0.001483 0.003172 -0.027897 -0.010697 -0.011901 -0.007700 -0.000700 0.003396 -0.002597
rs999494 -0.000829 0.010356 -0.0274 0.0274 0.000675 0.006377 -0.014707 -0.014568 0.021090 -0.000227 ... -0.0126 0.002342 -0.014567 -0.060596 0.013202 0.019204 0.016100 -0.070304 -0.019803 0.010100

10068 rows × 69 columns

select_ids = beta_df.columns

PCA

We use the z-score as the observation statistic for each phenotype and variant. For PCA, we center the columns of the X matrix.

Code
trait_df = pd.read_csv(f"{data_dir}/trait_meta.csv")
phenotype_dict = trait_df.set_index('ID')['Broad'].to_dict()
#phenotype_dict
Code
X = np.array(zscore_df.replace(np.nan, 0)[select_ids]).T
colmeans = np.mean(X, axis = 0, keepdims = True)
Xcent = X - colmeans

labels = [phenotype_dict[x] for x in select_ids]
unique_labels = list(set(labels))

print (f"We have {Xcent.shape[0]} samples (phenotypes) and {Xcent.shape[1]} features (variants)")
We have 69 samples (phenotypes) and 10068 features (variants)
Code
ncomp = min(Xcent.shape)
pca = PCA(n_components = ncomp)
pca.fit(Xcent)
beta_pcs = pca.fit_transform(Xcent)
beta_eig  = pca.components_

The variance explained by each principal component is shown in Figure 1

Code
fig = plt.figure(figsize = (14, 6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

#ax1.scatter(np.arange(ncomp), pca.explained_variance_ratio_, s = 100, alpha = 0.7)
ax1.plot(np.arange(ncomp), np.cumsum(pca.explained_variance_ratio_), marker = 'o')
ax1.set_ylabel("Fraction of variance explained")
ax1.set_xlabel("Principal Components")

ax2.plot(np.arange(ncomp), pca.explained_variance_ratio_, marker = 'o')
ax2.set_ylabel("Variance explained by each component")
ax2.set_xlabel("Principal Components")

plt.tight_layout(w_pad = 3)
plt.show()
Figure 1: Variance explained by the Principal Components

We plot the first 6 principal components against each other in Figure 2. Each dot is a sample (phenotype) colored by their broad label of NPD.

Code
plot_ncomp = 6
subplot_h = 2.0

nrow = plot_ncomp - 1
ncol = plot_ncomp - 1
figw = ncol * subplot_h + (ncol - 1) * 0.3 + 1.2
figh = nrow * subplot_h + (nrow - 1) * 0.3 + 1.5
bgcolor = '#F0F0F0'


def make_plot_principal_components(ax, i, j, comp):
    pc1 = comp[:, j]
    pc2 = comp[:, i]
    for label in unique_labels:
        idx = np.array([k for k, x in enumerate(labels) if x == label])
        ax.scatter(pc1[idx], pc2[idx], s = 30, alpha = 0.8, label = label)
    return

fig = plt.figure(figsize = (figw, figh))
axmain = fig.add_subplot(111)

for i in range(1, nrow + 1):
    for j in range(ncol):
        ax = fig.add_subplot(nrow, ncol, ((i - 1) * ncol) + j + 1)
        
        ax.tick_params(bottom = False, top = False, left = False, right = False,
                       labelbottom = False, labeltop = False, labelleft = False, labelright = False)
        if j == 0: ax.set_ylabel(f"PC{i + 1}")
        if i == nrow: ax.set_xlabel(f"PC{j + 1}")
        if i > j:
            ax.patch.set_facecolor(bgcolor)
            ax.patch.set_alpha(0.3)
            make_plot_principal_components(ax, i, j, beta_pcs)
            for side, border in ax.spines.items():
                border.set_color(bgcolor)
        else:
            ax.patch.set_alpha(0.)
            for side, border in ax.spines.items():
                border.set_visible(False)
                
        if i == 1 and j == 0:
            mhandles, mlabels = ax.get_legend_handles_labels()

axmain.tick_params(bottom = False, top = False, left = False, right = False,
                   labelbottom = False, labeltop = False, labelleft = False, labelright = False)
for side, border in axmain.spines.items():
    border.set_visible(False)
axmain.legend(handles = mhandles, labels = mlabels, loc = 'upper right', bbox_to_anchor = (0.9, 0.9))
          
        
plt.tight_layout()
plt.show()
Figure 2: First 6 principal components compared against each other.

PCA using SVD

Just to make sure that I understand things correctly, I am redoing the PCA using SVD. It must yield the same results as sklearn PCA. Indeed, it does as shown in Figure 3.

Code
U, S, Vt = np.linalg.svd(Xcent, full_matrices=False)
svd_pcs = U @ np.diag(S)
Code
idx1 = 0
idx2 = 1
svd_pc1 = svd_pcs[:, idx1]
svd_pc2 = svd_pcs[:, idx2]

fig = plt.figure()
ax1 = fig.add_subplot(111)

for label in unique_labels:
    idx = np.array([i for i, x in enumerate(labels) if x == label])
    ax1.scatter(svd_pc1[idx], svd_pc2[idx], s = 100, alpha = 0.7, label = label)
    
# idxs = np.array([i for i, x in enumerate(labels) if x == 'Sleep'])
# for idx in idxs:
#     pid = select_ids[idx]
#     ax1.annotate(pid, (svd_pc1[idx], svd_pc2[idx]))

    
ax1.legend(bbox_to_anchor=(1.04, 1), loc="upper left")
ax1.set_xlabel(f"Component {idx1}")
ax1.set_ylabel(f"Component {idx2}")
plt.show()
Figure 3: Check PCA using SVD.

Weighted PCA

Here, I implement the weighted covariance eigendecomposition approach algorithm proposed by L. Delchambre (2014). I used 3 different versions of the weight matrix:

  • the identity matrix (to check that the implementation is correct).
  • the precision matrix (as suggested by David in the project proposal).
  • square root of the precision (ad-hoc idea to reduce the weights on each observation)
# W = np.ones(X.shape)
# W = np.array(prec_df[select_ids]).T
W = np.sqrt(np.array(prec_df.replace(np.nan, 0)[select_ids]).T)
Code
def weighted_mean(x, w=None, axis=None):
    """Compute the weighted mean along the given axis

    The result is equivalent to (x * w).sum(axis) / w.sum(axis),
    but large temporary arrays are not created.

    Parameters
    ----------
    x : array_like
        data for which mean is computed
    w : array_like (optional)
        weights corresponding to each data point. If supplied, it must be the
        same shape as x
    axis : int or None (optional)
        axis along which mean should be computed

    Returns
    -------
    mean : np.ndarray
        array representing the weighted mean along the given axis
    """
    if w is None:
        return np.mean(x, axis)

    x = np.asarray(x)
    w = np.asarray(w)

    if x.shape != w.shape:
        raise NotImplementedError("Broadcasting is not implemented: "
                                  "x and w must be the same shape.")

    if axis is None:
        wx_sum = np.einsum('i,i', np.ravel(x), np.ravel(w))
    else:
        try:
            axis = tuple(axis)
        except TypeError:
            axis = (axis,)

        if len(axis) != len(set(axis)):
            raise ValueError("duplicate value in 'axis'")

        trans = sorted(set(range(x.ndim)).difference(axis)) + list(axis)
        operand = "...{0},...{0}".format(''.join(chr(ord('i') + i)
                                                 for i in range(len(axis))))
        wx_sum = np.einsum(operand,
                           np.transpose(x, trans),
                           np.transpose(w, trans))

    return wx_sum / np.sum(w, axis)

def weighted_pca_delchambre(X, W, n_components = None, regularization = None):
    
    import scipy as sp
    
    weights = weighted_mean(X, W, axis = 0).reshape(1, -1)
    _X = (X - weights) *  weights
    _covar = np.dot(_X.T, _X)
    _covar /= np.dot(weights.T, weights)
    _covar[np.isnan(_covar)] = 0

    n_components = 20
    eigvals = (_X.shape[1] - n_components, _X.shape[1] - 1)
    evals, evecs = sp.linalg.eigh(_covar, subset_by_index = eigvals)

    components = evecs[:, ::-1].T
    explained_variance = evals[::-1]
    Y = np.zeros((_X.shape[0], components.shape[0]))
    for i in range(_X.shape[0]):
        cW = components * weights[0, i]
        cWX = np.dot(cW, _X[i])
        cWc = np.dot(cW, cW.T)
        if regularization is not None:
            cWc += np.diag(regularization / explained_variance)
        Y[i] = np.linalg.solve(cWc, cWX)
    return Y, explained_variance
Code
weighted_pcs, explained_variance = weighted_pca_delchambre(Xcent, W, n_components = 20, regularization = None)

As before, the variance explained by each principal component is shown in Figure 4

Code
wpca_tot_explained_variance = np.sum(explained_variance)
wpca_explained_variance_ratio = explained_variance / wpca_tot_explained_variance

fig = plt.figure(figsize = (14, 6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.plot(np.arange(20), np.cumsum(wpca_explained_variance_ratio), marker = 'o')
ax1.set_ylabel("Fraction of variance explained")
ax1.set_xlabel("Principal Components")

ax2.plot(np.arange(20), wpca_explained_variance_ratio, marker = 'o')
ax2.set_ylabel("Variance explained by each component")
ax2.set_xlabel("Principal Components")

plt.tight_layout(w_pad = 3)
plt.show()
Figure 4: Variance explained by the weighted Principal Components

We plot the first 6 principal components against each other in Figure 5. Each dot is a sample (phenotype) colored by their broad label of NPD.

Code
plot_ncomp = 6
subplot_h = 2.0

nrow = plot_ncomp - 1
ncol = plot_ncomp - 1
figw = ncol * subplot_h + (ncol - 1) * 0.3 + 1.2
figh = nrow * subplot_h + (nrow - 1) * 0.3 + 1.5
bgcolor = '#F0F0F0'


def make_plot_principal_components(ax, i, j, comp):
    pc1 = comp[:, j]
    pc2 = comp[:, i]
    for label in unique_labels:
        idx = np.array([k for k, x in enumerate(labels) if x == label])
        ax.scatter(pc1[idx], pc2[idx], s = 30, alpha = 0.8, label = label)
    return

fig = plt.figure(figsize = (figw, figh))
axmain = fig.add_subplot(111)

for i in range(1, nrow + 1):
    for j in range(ncol):
        ax = fig.add_subplot(nrow, ncol, ((i - 1) * ncol) + j + 1)
        
        ax.tick_params(bottom = False, top = False, left = False, right = False,
                       labelbottom = False, labeltop = False, labelleft = False, labelright = False)
        if j == 0: ax.set_ylabel(f"PC{i + 1}")
        if i == nrow: ax.set_xlabel(f"PC{j + 1}")
        if i > j:
            ax.patch.set_facecolor(bgcolor)
            ax.patch.set_alpha(0.3)
            make_plot_principal_components(ax, i, j, weighted_pcs)
            for side, border in ax.spines.items():
                border.set_color(bgcolor)
        else:
            ax.patch.set_alpha(0.)
            for side, border in ax.spines.items():
                border.set_visible(False)
                
        if i == 1 and j == 0:
            mhandles, mlabels = ax.get_legend_handles_labels()

axmain.tick_params(bottom = False, top = False, left = False, right = False,
                   labelbottom = False, labeltop = False, labelleft = False, labelright = False)
for side, border in axmain.spines.items():
    border.set_visible(False)
axmain.legend(handles = mhandles, labels = mlabels, loc = 'upper right', bbox_to_anchor = (0.9, 0.9))
          
        
plt.tight_layout()
plt.show()
Figure 5: First 6 weighted principal components compared against each other.

Further Reading