Our goal is to create a structure plot from the PCA. We obtain low rank approximation of the GWAS summary statistics data using the three methods we have developed till date. We manually label each GWAS with a broad phenotype. We perform PCA on the approximated low rank matrix and analyze the clustering of the data by comparing with the curated phenotype labels.
Getting Setup
Code
import numpy as npimport pandas as pdimport picklefrom scipy.stats import pearsonrimport matplotlib.pyplot as pltfrom matplotlib.gridspec import GridSpecfrom pymir import mpl_stylesheetfrom pymir import mpl_utilsmpl_stylesheet.banskt_presentation(splinecolor ='black', dpi =120, colors ='kelly')from nnwmf.optimize import IALMfrom nnwmf.optimize import FrankWolfe, FrankWolfe_CVfrom nnwmf.utils import model_errors as merrimport syssys.path.append("../utils/")import histogram as mpy_histogramimport simulate as mpy_simulateimport plot_functions as mpy_plotfn
Loading data
Code
data_dir ="../data"beta_df_filename =f"{data_dir}/beta_df.pkl"prec_df_filename =f"{data_dir}/prec_df.pkl"se_df_filename =f"{data_dir}/se_df.pkl"zscore_df_filename =f"{data_dir}/zscore_df.pkl"snp_info_filename =f"{data_dir}/snp_info.pkl"'''Data Frames for beta, precision, standard error and zscore.'''beta_df = pd.read_pickle(beta_df_filename)prec_df = pd.read_pickle(prec_df_filename)se_df = pd.read_pickle(se_df_filename)zscore_df = pd.read_pickle(zscore_df_filename)snp_info = pd.read_pickle(snp_info_filename)trait_df = pd.read_csv(f"{data_dir}/trait_meta.csv")phenotype_dict = trait_df.set_index('ID')['Broad'].to_dict()
Code
zscore_df
AD_sumstats_Jansenetal_2019sept.txt.gz
CNCR_Insomnia_all
GPC-NEO-NEUROTICISM
IGAP_Alzheimer
Jones_et_al_2016_Chronotype
Jones_et_al_2016_SleepDuration
MDD_MHQ_BIP_METACARPA_INFO6_A5_NTOT_no23andMe_...
MDD_MHQ_METACARPA_INFO6_A5_NTOT_no23andMe_noUK...
MHQ_Depression_WG_MAF1_INFO4_HRC_Only_Filtered...
MHQ_Recurrent_Depression_WG_MAF1_INFO4_HRC_Onl...
...
ieu-b-7
ieu-b-8
ieu-b-9
ocd_aug2017.txt.gz
pgc-bip2021-BDI.vcf.txt.gz
pgc-bip2021-BDII.vcf.txt.gz
pgc-bip2021-all.vcf.txt.gz
pgc.scz2
pgcAN2.2019-07.vcf.txt.gz
pts_all_freeze2_overall.txt.gz
rs1000031
-0.999531
-0.327477
1.241557
0.441709
-0.163658
0.163658
-0.336654
-0.793129
-1.075357
-2.182304
...
0.532189
NaN
NaN
-0.198735
1.057089
-0.269020
1.279776
-0.433158
-1.573766
-1.674269
rs1000269
-1.212805
-1.046310
0.741814
-1.844296
-2.673787
-1.126391
0.092067
0.163246
1.643581
2.122280
...
1.665179
-0.732000
-0.699000
0.100883
-0.226381
0.338368
-0.924392
0.832016
0.681645
-0.701776
rs10003281
-0.813444
2.034345
-1.750164
-0.076778
-0.954165
1.805477
NaN
NaN
NaN
NaN
...
-0.475795
4.437998
2.366001
0.967399
0.286699
-1.162661
-0.199299
0.014539
NaN
-1.379710
rs10004866
0.011252
1.327108
1.442363
-1.215173
-0.050154
-1.439531
2.458370
2.407460
-0.001038
-1.678331
...
-1.234375
-2.520001
-0.593997
-0.685110
0.902252
1.106939
1.776456
-1.654677
-0.964630
0.851608
rs10005235
0.612540
-0.410609
0.653087
0.344062
-2.183486
1.514102
-0.460191
-0.393006
1.015614
0.180744
...
0.387805
-0.345000
-0.960998
0.177317
-1.339598
1.795867
-1.249969
2.349671
0.996305
-0.333356
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
rs9989571
0.028306
-0.208891
0.366470
0.821257
0.453762
-1.895698
0.218149
0.920789
1.581000
1.121551
...
-1.231511
-0.820996
0.712000
-2.150176
-0.877410
-1.938969
-2.729983
3.207917
1.469194
-1.293122
rs9991694
-0.679790
-1.005571
0.753472
-0.539271
1.674665
-2.862736
-3.744820
-3.583060
-4.072853
-2.804192
...
0.064417
NaN
NaN
-2.884911
-1.000231
0.031860
-1.248222
2.309425
NaN
1.048454
rs9992763
0.691405
-0.010299
-0.140010
-0.419843
-0.138304
0.568052
0.019684
-0.194404
0.869694
0.061210
...
0.191860
-0.074000
1.030997
-0.228287
-0.051297
0.781766
0.010638
0.456681
-0.503370
-1.435277
rs9993607
-1.625392
-0.391585
0.514268
0.027576
0.150969
-0.113039
-4.638940
-4.631950
-2.918354
-2.204015
...
-0.685106
0.194000
0.240001
-0.790290
-0.876804
-0.577696
-0.785670
-0.062707
0.240834
-0.199740
rs999494
-0.303642
0.872613
-0.227674
1.390424
0.138304
1.281552
1.873050
1.645590
1.381878
-0.010875
...
-0.437500
0.253000
-0.926997
-1.449674
0.910515
0.783853
1.376043
-5.195746
-1.151316
0.660120
10068 rows × 69 columns
X_nan = np.array(zscore_df).TX_nan_cent = X_nan - np.nanmean(X_nan, axis =0, keepdims =True)X_nan_mask = np.isnan(X_nan)X_cent = np.nan_to_num(X_nan_cent, copy =True, nan =0.0)print (f"We have {X_cent.shape[0]} samples (phenotypes) and {X_cent.shape[1]} features (variants)")print (f"Fraction of Nan entries: {np.sum(X_nan_mask) / np.prod(X_cent.shape):.3f}")
We have 69 samples (phenotypes) and 10068 features (variants)
Fraction of Nan entries: 0.193
select_ids = zscore_df.columnslabels = [phenotype_dict[x] for x in select_ids]unique_labels =list(set(labels))nsample = X_cent.shape[0]ntrait =len(unique_labels)trait_indices = [np.array([i for i, x inenumerate(labels) if x == label]) for label in unique_labels]trait_colors = {trait: color for trait, color inzip(unique_labels, (mpl_stylesheet.kelly_colors())[:ntrait])}
We perform PCA (using SVD) on the raw input data (mean centered). In ?@fig-input-pca-pve, we look at the proportion of variance explained by each principal component.
Low rank matrices
We have run 3 different methods to obtain low rank matrices, namely IALM, NNM and NNM-Sparse. Here, we load the results from those methods.
Code
mf_methods = ['ialm', 'nnm', 'nnm_sparse']lowrank_X =dict()for method in mf_methods:withopen (f"{data_dir}/lowrank_X_{method}.pkl", 'rb') as handle: lowrank_X[method] = pickle.load(handle)
Principal components
Suppose, we decompose \mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{V}^{\intercal}. Columns of \mathbf{V} are the principal axes (aka principal directions, aka eigenvectors). The principal components are the columns of \mathbf{U}\mathbf{S} – the projections of the data on the the principal axes (note \mathbf{X}\mathbf{V} = \mathbf{U}\mathbf{S}).
In Figure 1, we look at the eigenvalues obtained from PCA of the low rank matrices. The first plot on the left shows the PCA on the raw data, without noise removal. It is evident that a truncated SVD (as shown below) will not be able to capture the variance in the data.
Contribution of principal components on phenotypes
We look at the contribution of the loadings on the phenotypes using structure plot. The importance of a principal component is reflected by its inertia or by the proportion of the total inertia “explained” by this factor. We can calculate the contribution of an observation to the component or the contribution of the component to an observation.
Suppose, the principal components are \mathbf{F} = \mathbf{U}\mathbf{S}. Then, the squared cosine shows the importance of a component for a given observation. \cos_{i, l}^2 = \frac{f_{i,l}^2}{\sum_{l}f_{i,l}^2}
Intuitively, the above factor corresponds to the square of the cosine of the angle from the right triangle made with the origin, the observation and its projection on the component.
def get_cos2_scores(pcomps): ntrait, npcomp = pcomps.shape x = np.zeros((ntrait, npcomp))for i inrange(ntrait): cos2_trait = np.array([np.square(pcomps[i, pcidx]) for pcidx inrange(npcomp)]) x[i, :] = cos2_trait / np.sum(cos2_trait)return xdef stacked_barplot(ax, data, xlabels, colors, bar_width =1.0, alpha =1.0, showxlabels =False):''' Parameters ---------- data: dict() of scores. - <key> : items for the stacked bars (e.g. traits or components) - <value> : list of scores for the items. All dict entries must have the same length of <value> xlabels: label for each entry in the data <value> list. Must be of same length of data <value> colors: dict(<key>, <color>) corresponding to each data <key>. ''' indices = np.arange(len(xlabels)) bottom = np.zeros(len(xlabels))for item, weights in data.items(): ax.bar(indices, weights, bar_width, label = item, bottom = bottom, color = colors[item], alpha = alpha) bottom += weightsif showxlabels: ax.set_xticks(indices) ax.set_xticklabels(xlabels, rotation=90, ha='center') ax.tick_params(bottom =True, top =False, left =False, right =False, labelbottom =True, labeltop =False, labelleft =False, labelright =False)else: ax.tick_params(bottom =False, top =False, left =False, right =False, labelbottom =False, labeltop =False, labelleft =False, labelright =False)for side, border in ax.spines.items(): border.set_visible(False)returndef structure_plot(ax, pcomps, trait_labels, comp_colors, npcomp, showxlabels =False): cos2_scores = get_cos2_scores(pcomps)[:, :npcomp] cos2_plot_data = {f"{i+1}" : cos2_scores[:, i] for i inrange(npcomp) } stacked_barplot(ax, cos2_plot_data, trait_labels, comp_colors, alpha =0.8, showxlabels = showxlabels)return
In Figure 2, we look at the contribution of the top 10 principal components to the disease phenotypes.
'''Calculate the correlations and pvals, if required.This takes time, so the pre-computed results can also be loaded from saved files'''corrs =dict()pvals =dict()# for m in plot_methods:# corrs[m], pvals[m] = get_corrmat(pcomps[m], X_cent, ncomp = 20)# with open (f"{data_dir}/loading_corr_{m}.pkl", 'wb') as handle:# pickle.dump(corrs[m], handle, protocol=pickle.HIGHEST_PROTOCOL)# with open (f"{data_dir}/loading_corr_pval_{m}.pkl", 'wb') as handle:# pickle.dump(pvals[m], handle, protocol=pickle.HIGHEST_PROTOCOL)for m in plot_methods:withopen (f"{data_dir}/loading_corr_{m}.pkl", 'rb') as handle: corrs[m] = pickle.load(handle)withopen (f"{data_dir}/loading_corr_pval_{m}.pkl", 'rb') as handle: pvals[m] = pickle.load(handle)
def corr_to_manhattan_data(corr_data, pcidx): data = {i+1: dict() for i inrange(22)}for i, val inenumerate(corr_data[:, pcidx]): rsid = rsid_list[i] chrm =int(snp_info_dict[rsid]['CHR']) bppos = snp_info_dict[rsid]['BP'] data[chrm][bppos] = np.square(val)return datadef pval_to_manhattan_data(pval_data, pcidx): data = {i+1: dict() for i inrange(22)}for i, val inenumerate(pval_data[:, pcidx]): rsid = rsid_list[i] chrm =int(snp_info_dict[rsid]['CHR']) bppos = snp_info_dict[rsid]['BP'] data[chrm][bppos] =- np.log10(val)return datadef plot_manhattan(ax, data, ylabel, showx =True): i =0 start =0 end =0 xtickposlist =list() offcolors = ['darkorange', 'firebrick']for chrm, cvals in data.items(): end = start + snp_tot[chrm] xtickposlist.append(int((start + end) /2)) x = [ start + bp for bp inlist(cvals.keys())] y =list(cvals.values()) ax.scatter(x, y, color=offcolors[chrm%2], s =3, alpha =0.8) start = end ax.set_xlim(0, end)#ax.set_ylim(0, 1)#ax.plot([0, lastbp], [log10cutoff, log10cutoff], ls = 'dashed', color='gainsboro', lw = 2)#ax.text(0.05, 1.0, tname, transform=ax.transAxes, ha='left', va='top')for side, border in ax.spines.items():if side =='top': border.set_visible(False) ax.set_xticks(xtickposlist) ax.tick_params(axis='x', labelsize=16) ax.set_ylabel(ylabel)if showx: ax.set_xticklabels(["{:d}".format(x) for x in data.keys()], rotation =90)else: ax.set_xticklabels([""for x in data.keys()])return
In Figure 3, we show the Manhattan plot for the association of each variant with the principal components. We calculate the Pearson’s coefficient of correlation between the variants’ z-scores and the principal components and use the -\log_{10}(p) value as the significance of the correlation.