We implement the Frank-Wolfe algorithm for nuclear norm regularization. We compare it with an algorithm proposed by Duchi et al for projection on nuclear norm ball.
About
David proposed a convex optimization algorithm for nuclear norm minimization of weighted matrix factorization. This is a pilot implementation of the algorithm, following his implementation see here. I was not sure if the weighted matrix factorization was working. Therefore, I implemented the matrix completion to make sure that the algorithm is working to find the missing data.
In this proof-of-concept, we show that the NNWMF (we need a better acronym) indeed can find the missing data despite the noise in real data using an unit weight matrix.
import numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.utils.extmath import randomized_svdimport matplotlib.pyplot as pltfrom pymir import mpl_stylesheetfrom pymir import mpl_utilsmpl_stylesheet.banskt_presentation(splinecolor ='black', dpi =120, colors ='kelly')
Data
Summary statistics data for NPD is collected from PGC, OpenGWAS and GTEx. See previous work for data cleaning and filtering. Our input is the Z-Score matrix for N diseases and P variants.
Code
data_dir ="../data"beta_df_filename =f"{data_dir}/beta_df.pkl"prec_df_filename =f"{data_dir}/prec_df.pkl"se_df_filename =f"{data_dir}/se_df.pkl"zscore_df_filename =f"{data_dir}/zscore_df.pkl"'''Data Frames for beta, precision, standard error and zscore.'''beta_df = pd.read_pickle(beta_df_filename)prec_df = pd.read_pickle(prec_df_filename)se_df = pd.read_pickle(se_df_filename)zscore_df = pd.read_pickle(zscore_df_filename)trait_df = pd.read_csv(f"{data_dir}/trait_meta.csv")phenotype_dict = trait_df.set_index('ID')['Broad'].to_dict()
select_ids = beta_df.columnsX = np.array(zscore_df.replace(np.nan, 0)[select_ids]).Tcolmeans = np.mean(X, axis =0, keepdims =True)Xcent = X - colmeanslabels = [phenotype_dict[x] for x in select_ids]unique_labels =list(set(labels))print (f"We have {Xcent.shape[0]} samples (phenotypes) and {Xcent.shape[1]} features (variants)")
We have 69 samples (phenotypes) and 10068 features (variants)
Code
U, S, Vt = np.linalg.svd(Xcent, full_matrices =False)print (f"Nuclear Norm of input matrix: {np.sum(S)}")
The idea is to use the precision matrix for the weights of the Z-scores. However, this is still under development. For now, we use equal weights for all Z-scores to compare NNWMF against classical SVD.
In the following, we implement the Frank-Wolfe optimization. The implementation currently uses a sequential value for the step size.
To-do: Use line search for the step size.
def f_objective(X, Y, mask =None):''' Objective function Y is observed, X is estimated ''' Xmask = X if mask isNoneelse X * maskreturn0.5* np.linalg.norm(Xmask - Y, 'fro')**2def f_gradient(X, Y, mask =None):''' Gradient of the objective function. ''' Xmask = X if mask isNoneelse X * maskreturn Xmask - Ydef f_rmse(X, Y, mask =None):''' RMSE for masked CV ''' Xmask = X if mask isNoneelse X * maskreturn np.sqrt(np.mean(np.square(Xmask - Y)))def linopt_oracle(grad, r =1.0):''' Linear optimization oracle, where the feasible region is a nuclear norm ball for some r ''' U1, V1_T = singular_vectors_power_method(grad, max_iter =5) S =- r * U1 @ V1_Treturn Sdef singular_vectors_randomized_method(X, max_iter =10): u, s, vh = randomized_svd(X, n_components =1, n_iter = max_iter, power_iteration_normalizer ='none', random_state =0)return u, vhdef singular_vectors_power_method(X, max_iter =10):''' Power method. Computes approximate top left and right singular vector. Parameters: ----------- X : array {m, n}, input matrix max_iter : integer, optional number of steps Returns: -------- u, v : (n, 1), (p, 1) two arrays representing approximate top left and right singular vectors. ''' n, p = X.shape u = np.random.normal(0, 1, n) u /= np.linalg.norm(u) v = X.T.dot(u) v /= np.linalg.norm(v)for _ inrange(max_iter): u = X.dot(v) u /= np.linalg.norm(u) v = X.T.dot(u) v /= np.linalg.norm(v) return u.reshape(-1, 1), v.reshape(1, -1)def frank_wolfe_minimize_step(X, Y, r, istep, mask =None): G = f_gradient(X, Y, mask = mask) S = linopt_oracle(G, r) dg = np.trace((X - S).T @ G) step =2./ (2+ istep)#err = old_X + S#step = np.sum((G * err) / np.square(err))#step = min(step, 1) Xnew = X + step * (S - X)return Xnew, G, dg, stepdef frank_wolfe_minimize(Y, weight, r, X0 =None, mask =None, max_iter =1000, tol =1e-8, return_all =True, debug =False):# Step 0 old_X = np.zeros_like(Y) if X0 isNoneelse X0.copy() dg = np.infif return_all: dg_list = [dg] fx_list = [f_objective(old_X, Y, mask)]# Steps 1, ..., max_iterfor istep inrange(max_iter): X, G, dg, step =\ frank_wolfe_minimize_step(old_X, Y, r, istep, mask)if return_all: dg_list.append(dg) fx_list.append(f_objective(X, Y))if debug:if (istep %10==0):print (f"Iteration {istep}. Step size {step:.3f}. Duality Gap {dg:g}")if np.abs(dg) <= tol:break old_X = X.copy()if return_all:return X, dg_list, fx_listelse:return Xdef is_monotonically_increasing(x):returnall(x[i] >= x[i -1] for i inrange(1, x.shape[0]))def frank_wolfe_cv_minimize(Y, weight, X0 =None, r_seq =None, max_iter =1000, tol =1e-8, test_size =0.33, return_all =False, debug_fw =False, debug =False):# Prepare CV masks n, p = Y.shape train_mask = np.ones(n * p) train_mask[:int(test_size * n * p)] =0 np.random.shuffle(train_mask) train_mask = train_mask.reshape(n, p) test_mask =1- train_mask Ytrain = Y * train_mask Ytest = Y * test_mask# Prepare rseqif r_seq isNone: r_min =1 r_max =4.0* nuclear_norm(Y) nseq =int(np.floor(np.log2(r_max)) +1) +1#r_seq = np.logspace(-ndec, nseq - 1, num = nseq + ndec, base = 2.0) r_seq = np.logspace(0, nseq -1, num = nseq, base =2.0) nseq =len(r_seq) train_err_dict =dict() test_err_dict =dict() old_X =Noneif debug:print (f"Perform CV at {nseq} positions.")print (r_seq)for iseq, r inenumerate(r_seq): X = frank_wolfe_minimize(Ytrain, weight, r, max_iter = max_iter, X0 = old_X, tol = tol, mask = train_mask, return_all =False, debug = debug_fw) train_err = f_rmse(X, Ytrain, train_mask) test_err = f_rmse(X, Ytest, test_mask)if debug:print(f"CV sequence {iseq +1}, r = {r}, training error = {train_err}, test_error = {test_err}") train_err_dict[r] = train_err test_err_dict[r] = test_err old_X = X.copy()return r_seq, train_err_dict, test_err_dictdef nuclear_norm(X):''' Nuclear norm of input matrix '''return np.sum(np.linalg.svd(X)[1])# Code taken from https://gist.github.com/daien/1272551def simplex_projection(s, alpha =1.0):''' Compute the Euclidean projection on a positive simplex Solves the optimisation problem (using the algorithm from [1]): min_w 0.5 * || w - v ||_2^2 , s.t. \sum_i w_i = s, w_i >= 0 Parameters ---------- s: (n,) numpy array, n-dimensional vector to project alpha: int, optional, default: 1, radius of the simplex Returns ------- w: (n,) numpy array, Euclidean projection of v on the simplex Notes ----- The complexity of this algorithm is in O(n log(n)) as it involves sorting v. Better alternatives exist for high-dimensional sparse vectors (cf. [1]) However, this implementation still easily scales to millions of dimensions. References ---------- [1] Efficient Projections onto the .1-Ball for Learning in High Dimensions John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. International Conference on Machine Learning (ICML 2008) http://www.cs.berkeley.edu/~jduchi/projects/DuchiSiShCh08.pdf ''' n, = s.shape# check if we are already on the simplexif np.sum(s) == alpha and np.alltrue(s >=0):# best projection: itself!return s# get the array of cumulative sums of a sorted (decreasing) copy of v u = np.sort(s)[::-1] cssv = np.cumsum(u)# get the number of > 0 components of the optimal solution rho = np.nonzero(u * np.arange(1, n +1) > (cssv - alpha))[0][-1]# compute the Lagrange multiplier associated to the simplex constraint theta =float(cssv[rho] - alpha) / (rho +1.0)# compute the projection by thresholding v using theta w = (s - theta).clip(min=0)return wdef nuclear_projection(X, r =1.0):''' Projection onto nuclear norm ball. ''' U, s, Vt = np.linalg.svd(X, full_matrices=False) sproj = simplex_projection(s, alpha = r)return U @ np.diag(sproj) @ Vt
Numerical Experiment 1: Projection of input data on nuclear norm ball with rank <= 1000
In our first experiment, we project the input data on a nuclear norm ball with rank r \le 1000.
Simplex projection
Project the input matrix on nuclear norm ball, using an algorithm proposed by Duchi et. al. in “Efficient projections onto the l1-ball for learning in high dimensions”, Proc. 25th ICML, pages 272–279. ACM, 2008. This is computationally expensive but will help to compare the results from Frank-Wolfe algorithm.
Code
Xproj = nuclear_projection(Xcent, r =1000)Xproj = Xproj - np.mean(Xproj, axis =0, keepdims =True)U, S, Vt = np.linalg.svd(Xproj)print(f"Nuclear norm of projected input matrix is {np.sum(S)}")
Nuclear norm of projected input matrix is 1000.0000000000019
In Figure 2, we show the progress of the optimization. On the left hand side, we show the duality gap and on the right hand side, we show the objective function.
In Figure 3, we compare the singular values of the output signal from the nuclear norm projection and the Frank-Wolfe algorithm. This gives us confidence with the implementation.
In this numerical experiment, we plant missing onto the input matrix and then obtain the nuclear norm projection, which can approximate the missing data. We compare the singular values obtained from the simplex projection and Frank-Wolfe method in Figure 5
Code
# Generate missing datan, p = Xcent.shapetest_size =0.33O = np.ones(n * p)ntest =int(test_size * n * p)O[:ntest] =0np.random.shuffle(O)O = O.reshape(n, p)Xmiss = X * O
Code
Xmiss_proj = nuclear_projection(Xmiss, r =1000)Um, Sm, Vtm = np.linalg.svd(Xmiss_proj)print(f"Nuclear norm of projected input matrix is {np.sum(Sm)}")proj_err = np.sqrt(np.mean(np.square( (X * (1- O)) - (Xmiss_proj * (1- O)) )))print (f"RMSE on test data = {proj_err}")
Nuclear norm of projected input matrix is 1000.0000000000011
RMSE on test data = 0.6567133713958933
Perform CV at 14 positions.
[ 8. 16. 32. 64. 128. 256. 512. 1024. 2048. 4096. 5000. 6000.
7000. 8000.]
CV sequence 1, r = 8.0, training error = 1.0445314519131719, test_error = 0.7379283863527464
CV sequence 2, r = 16.0, training error = 1.041037978310444, test_error = 0.7360553648577052
CV sequence 3, r = 32.0, training error = 1.0342448061985816, test_error = 0.7323992229046496
CV sequence 4, r = 64.0, training error = 1.0212031961118844, test_error = 0.7252923894633748
CV sequence 5, r = 128.0, training error = 0.9980512728715728, test_error = 0.7126499267407608
CV sequence 6, r = 256.0, training error = 0.9592552598273123, test_error = 0.6929325616806638
CV sequence 7, r = 512.0, training error = 0.8958869142145245, test_error = 0.6650417778754745
CV sequence 8, r = 1024.0, training error = 0.7931029154998555, test_error = 0.6294795062102471
CV sequence 9, r = 2048.0, training error = 0.6284386790184624, test_error = 0.5984243261174017
CV sequence 10, r = 4096.0, training error = 0.3992465647184987, test_error = 0.5813444546875715
CV sequence 11, r = 5000.0, training error = 0.36700702863471724, test_error = 0.5978681673912616
CV sequence 12, r = 6000.0, training error = 0.37273215971181584, test_error = 0.6282368213911593
CV sequence 13, r = 7000.0, training error = 0.39743960536609935, test_error = 0.6693131025636734
CV sequence 14, r = 8000.0, training error = 0.46060258805130466, test_error = 0.6992722149861493
In Figure 6, we show the training error for different values of the regularization parameter. Finally, we run the Frank-Wolfe optimization on the full data at the optimum value of the regularization parameter.