Metrices for evaluating clusters given true labels

Author

Saikat Banerjee

Published

June 23, 2023

Abstract

We explore different external validation metrics to measure the quality of clustering results. These external indices measure the agreement between the predicted partition and the known partition.

About

There are many aspects of “rightness” for clustering. Broadly, there are two kinds of validity indices to measure the quality of clustering results: external indices and internal indices. An external index is a measure of agreement between two partitions where the first partition is the a priori known clustering structure, and the second results from the clustering procedure. Internal indices are used to measure the goodness of a clustering structure without external information. For external indices, we evaluate the results of a clustering algorithm based on a known cluster structure of a data set (or cluster labels). Here, we look at several possible external validation metrics.

Getting set up

Code

import numpy as np
import pandas as pd

from sklearn import model_selection as skmodel
from sklearn import metrics as skmetrics

import matplotlib.pyplot as plt
from pymir import mpl_stylesheet
from pymir import mpl_utils

mpl_stylesheet.banskt_presentation(splinecolor = 'black', dpi = 120, colors = 'kelly')

Code

data_dir = "../data"
beta_df_filename   = f"{data_dir}/beta_df.pkl"
prec_df_filename   = f"{data_dir}/prec_df.pkl"
se_df_filename     = f"{data_dir}/se_df.pkl"
zscore_df_filename = f"{data_dir}/zscore_df.pkl"

'''
Data Frames for beta, precision, standard error and zscore.
'''

beta_df   = pd.read_pickle(beta_df_filename)
prec_df   = pd.read_pickle(prec_df_filename)
se_df     = pd.read_pickle(se_df_filename)
zscore_df = pd.read_pickle(zscore_df_filename)

trait_df = pd.read_csv(f"{data_dir}/trait_meta.csv")
phenotype_dict = trait_df.set_index('ID')['Broad'].to_dict()

'''
X matrix (n_samples x n_features) -- obtain from Z-scores
'''
select_ids = beta_df.columns
X = np.array(zscore_df[select_ids]).T # contain NaN values
colmeans = np.nanmean(X, axis = 0, keepdims = True)
Xcent = X - colmeans # contain NaN values
Xcent = np.nan_to_num(X, nan=0) # remove NaN values

'''
Y vector (n_samples) -- contain class labels
'''

labels = [phenotype_dict[x] for x in select_ids]
unique_labels = list(set(labels))
encoding = {x:i for i, x in enumerate(unique_labels)}
Ylabels = np.array([encoding[x] for x in labels])

print (f"We have {Xcent.shape[0]} samples (phenotypes) and {Xcent.shape[1]} features (variants)")

We have 69 samples (phenotypes) and 10068 features (variants)

Sample counts of input data

Code

sample_counts = {label : (Ylabels == idx).sum() for label, idx in encoding.items()}
print (f"Count   Phenotype")
print (f"-----   ---------")
for phenotype, count in sample_counts.items():
    print (f"{count}\t{phenotype}")

Count   Phenotype
-----   ---------
7   Sleep
3   SZ
2   ASD
2   Migraine
2   ADHD
1   OCD
8   Depression
2   Intel/education
11  Epilepsy
10  Other psych
7   Cognition
6   BD
8   Neurodegenerative

Split into training and test data

Code

from sklearn import model_selection as skmodel

'''
One-liner to split:
# X_train, X_test, y_train, y_test = skmodel.train_test_split(X, Ylabels, test_size = 0.33)
but it does not return the index for the training and test data,
so I use a little more verbose solution
'''
itrain, itest = skmodel.train_test_split(np.arange(Ylabels.shape[0]), test_size = 0.33)
X_train = X[itrain, :]
X_test  = X[itest, :]
y_train = Ylabels[itrain]
y_test  = Ylabels[itest]

print (f"Train   Test    Phenotype")
print (f"-----   ----    ---------")
for phenotype, idx in encoding.items():
    train_count = np.sum(y_train == idx)
    test_count  = np.sum(y_test  == idx)
    print (f"{train_count}\t{test_count}\t{phenotype}")

Train   Test    Phenotype
-----   ----    ---------
7   0   Sleep
2   1   SZ
0   2   ASD
1   1   Migraine
1   1   ADHD
0   1   OCD
5   3   Depression
1   1   Intel/education
9   2   Epilepsy
5   5   Other psych
4   3   Cognition
6   0   BD
5   3   Neurodegenerative

Clustering from distance matrix

We want to cluster the samples based on the Euclidean distance between them, obtained from the feature matrix. There are hundreds of algorithms to choose from, for example: - Hierarchical clustering in it’s myriad of variants. Cut the dendrogram as desired, e.g., to get k clusters - PAM, the closest match to k-means on a distance matrix (minimizes the average distance from the cluster center) - Spectral clustering - DBSCAN - OPTICS - HDBSCAN* - Affinity Propagation

Available Software in Python: - pyclustering for fast Python implementation of different algorithms. They have nice documentation and examples. - sklearn.cluster - HDBSCAN* provides a very nice documentation for comparing different algorithms (albeit a bit biased, highlighting their own strength). - scipy.cluster provides the hierarchy module which has functions for hierarchical and agglomerative clustering.

Code

distance_matrix = skmetrics.pairwise.pairwise_distances(Xcent, metric='euclidean')

Code

from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters = len(unique_labels), linkage = 'average', metric = 'precomputed')
Y_pred = model.fit_predict(distance_matrix)
#km = KMeans(n_clusters = len(unique_labels), random_state = 0, n_init="auto")
#km.fit(Xcent)
#Y_pred = km.labels_

Code

Y_random = np.random.choice(len(unique_labels), size=Ylabels.shape[0], replace=True)

Comparison Metrics

We can use several external validation techniques to assess the quality or “correctness” of the clusters since we have manually assigned the cluster labels. For example, we can use adjusted rand index, adjusted mutual information, homogeneity/completeness/v-measure, Fowlkes-Mallows score.

Adjusted Rand Index

Code

print (f"Random: {skmetrics.adjusted_rand_score(Ylabels, Y_random):.5f}")
print (f"Predicted: {skmetrics.adjusted_rand_score(Ylabels, Y_pred):.5f}")

Random: -0.01573
Predicted: 0.15229

Adjusted Mutual Information

Code

print (f"Random: {skmetrics.adjusted_mutual_info_score(Ylabels, Y_random):.5f}")
print (f"Predicted: {skmetrics.adjusted_mutual_info_score(Ylabels, Y_pred):.5f}")

Random: -0.02511
Predicted: 0.36902

Homogeneity and V-measure

Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment: - homogeneity: each cluster contains only members of a single class. - completeness: all members of a given class are assigned to the same cluster.

We turn those concept as scores homogeneity_score and completeness_score. Both are bounded below by 0.0 and above by 1.0 (higher is better). Their harmonic mean called V-measure is computed by v_measure_score.

Note. v_measure_score is symmetric: it can be used to evaluate the agreement of two independent assignments on the same dataset. This is not the case for completeness_score and homogeneity_score: both are bound by the relationship:

homogeneity_score(a, b) == completeness_score(b, a)

Code

print (f"        Homogeneity \tCompleteness \tV-Measure")

hcv_random = skmetrics.homogeneity_completeness_v_measure(Ylabels, Y_random)
hcv_pred   = skmetrics.homogeneity_completeness_v_measure(Ylabels, Y_pred)

print ("Random:    " + ' \t'.join([f"{x:.5f}" for x in hcv_random]))
print ("Predicted: " + ' \t'.join([f"{x:.5f}" for x in hcv_pred]))

        Homogeneity     Completeness    V-Measure
Random:    0.36793  0.35036     0.35893
Predicted: 0.47942  0.70042     0.56922

Fowlkes-Mallows scores

FMI is defined as the geometric mean of the pairwise precision and recall.

Code

print (f"Random: {skmetrics.fowlkes_mallows_score(Ylabels, Y_random):.5f}")
print (f"Random: {skmetrics.fowlkes_mallows_score(Ylabels, Y_pred):.5f}")

Random: 0.07035
Random: 0.34056

Does distance matrix from truncated SVD improve score?

Code

K = 20

U, S, Vt = np.linalg.svd(Xcent, full_matrices=False)
pcomp_tsvd = U[:, :K] @ np.diag(S[:K])

distance_matrix_tsvd = skmetrics.pairwise.pairwise_distances(pcomp_tsvd, metric='euclidean')

model = AgglomerativeClustering(n_clusters = len(unique_labels), linkage = 'average', metric = 'precomputed')
Y_pred_tsvd = model.fit_predict(distance_matrix_tsvd)

skmetrics.adjusted_mutual_info_score(Ylabels, Y_pred_tsvd)

0.3878603292380521

About

Getting set up

Sample counts of input data

Split into training and test data

Clustering from distance matrix

Comparison Metrics

Adjusted Rand Index

Adjusted Mutual Information

Homogeneity and V-measure

Fowlkes-Mallows scores

Does distance matrix from truncated SVD improve score?

Further Reading