We explore different external validation metrics to measure the quality of clustering results. These external indices measure the agreement between the predicted partition and the known partition.
About
There are many aspects of “rightness” for clustering. Broadly, there are two kinds of validity indices to measure the quality of clustering results: external indices and internal indices. An external index is a measure of agreement between two partitions where the first partition is the a priori known clustering structure, and the second results from the clustering procedure. Internal indices are used to measure the goodness of a clustering structure without external information. For external indices, we evaluate the results of a clustering algorithm based on a known cluster structure of a data set (or cluster labels). Here, we look at several possible external validation metrics.
Getting set up
Code
import numpy as npimport pandas as pdfrom sklearn import model_selection as skmodelfrom sklearn import metrics as skmetricsimport matplotlib.pyplot as pltfrom pymir import mpl_stylesheetfrom pymir import mpl_utilsmpl_stylesheet.banskt_presentation(splinecolor ='black', dpi =120, colors ='kelly')
Code
data_dir ="../data"beta_df_filename =f"{data_dir}/beta_df.pkl"prec_df_filename =f"{data_dir}/prec_df.pkl"se_df_filename =f"{data_dir}/se_df.pkl"zscore_df_filename =f"{data_dir}/zscore_df.pkl"'''Data Frames for beta, precision, standard error and zscore.'''beta_df = pd.read_pickle(beta_df_filename)prec_df = pd.read_pickle(prec_df_filename)se_df = pd.read_pickle(se_df_filename)zscore_df = pd.read_pickle(zscore_df_filename)trait_df = pd.read_csv(f"{data_dir}/trait_meta.csv")phenotype_dict = trait_df.set_index('ID')['Broad'].to_dict()
'''X matrix (n_samples x n_features) -- obtain from Z-scores'''select_ids = beta_df.columnsX = np.array(zscore_df[select_ids]).T # contain NaN valuescolmeans = np.nanmean(X, axis =0, keepdims =True)Xcent = X - colmeans # contain NaN valuesXcent = np.nan_to_num(X, nan=0) # remove NaN values'''Y vector (n_samples) -- contain class labels'''labels = [phenotype_dict[x] for x in select_ids]unique_labels =list(set(labels))encoding = {x:i for i, x inenumerate(unique_labels)}Ylabels = np.array([encoding[x] for x in labels])print (f"We have {Xcent.shape[0]} samples (phenotypes) and {Xcent.shape[1]} features (variants)")
We have 69 samples (phenotypes) and 10068 features (variants)
Sample counts of input data
Code
sample_counts = {label : (Ylabels == idx).sum() for label, idx in encoding.items()}print (f"Count Phenotype")print (f"----- ---------")for phenotype, count in sample_counts.items():print (f"{count}\t{phenotype}")
from sklearn import model_selection as skmodel'''One-liner to split:# X_train, X_test, y_train, y_test = skmodel.train_test_split(X, Ylabels, test_size = 0.33)but it does not return the index for the training and test data,so I use a little more verbose solution'''itrain, itest = skmodel.train_test_split(np.arange(Ylabels.shape[0]), test_size =0.33)X_train = X[itrain, :]X_test = X[itest, :]y_train = Ylabels[itrain]y_test = Ylabels[itest]print (f"Train Test Phenotype")print (f"----- ---- ---------")for phenotype, idx in encoding.items(): train_count = np.sum(y_train == idx) test_count = np.sum(y_test == idx)print (f"{train_count}\t{test_count}\t{phenotype}")
We want to cluster the samples based on the Euclidean distance between them, obtained from the feature matrix. There are hundreds of algorithms to choose from, for example: - Hierarchical clustering in it’s myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters - PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center) - Spectral clustering - DBSCAN - OPTICS - HDBSCAN* - Affinity Propagation
Available Software in Python: - pyclustering for fast Python implementation of different algorithms. They have nice documentation and examples. - sklearn.cluster - HDBSCAN* provides a very nice documentation for comparing different algorithms (albeit a bit biased, highlighting their own strength). - scipy.cluster provides the hierarchy module which has functions for hierarchical and agglomerative clustering.
We can use several external validation techniques to assess the quality or “correctness” of the clusters since we have manually assigned the cluster labels. For example, we can use adjusted rand index, adjusted mutual information, homogeneity/completeness/v-measure, Fowlkes-Mallows score.
Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment: - homogeneity: each cluster contains only members of a single class. - completeness: all members of a given class are assigned to the same cluster.
We turn those concept as scores homogeneity_score and completeness_score. Both are bounded below by 0.0 and above by 1.0 (higher is better). Their harmonic mean called V-measure is computed by v_measure_score.
Note.v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments on the same dataset. This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:
homogeneity_score(a, b) == completeness_score(b, a)
Code
print (f" Homogeneity \tCompleteness \tV-Measure")hcv_random = skmetrics.homogeneity_completeness_v_measure(Ylabels, Y_random)hcv_pred = skmetrics.homogeneity_completeness_v_measure(Ylabels, Y_pred)print ("Random: "+' \t'.join([f"{x:.5f}"for x in hcv_random]))print ("Predicted: "+' \t'.join([f"{x:.5f}"for x in hcv_pred]))