2023, May

2023-05-24

Group meeting, listen to talks from “The Biology of Genomes”, a conference at CSHL from May9 - May13, 2023. We listened to two talks:

“Discovering stimulatory state specific T2D GWAS mechanisms with single cell multi-omics on iPSC-derived FAP villages” by Christa Ventresca from the Stephen Parker lab,
“DragoNNFruit—Learning cis- and trans-regulation of chromatin accessibility at single base and single cell resolution” by Jacob Schreiber from the Kundaje lab.

Start a Latex document to write the Frank-Wolfe algorithm for nuclear norm matrix factorization.

Python implementation for convex optimization using Frank-Wolfe algorithm.
I observed that the linear optimization problem using \(K = 40\) principal components in the first step of the FW algorithm retains the structure of the matrix, but this is wrong!

Implement robust PCA using ADMM (Candes’ algorithm) described here. I followed (copied) the implementation by N. Dorukhan Sergin.

Implement Delchambre’s weighted PCA.
I observed hidden structures for the summary statistics in conventional PCA, but I did not observe structures in weighted PCA. Why?

Implement preprocessing of \(\beta\) and SE matrix in Python. Required for playing with PCA and understanding the structure of the data.

Meeting with David. We discussed data cleaning and its effect on weighted PCA. See Slack for details.
Probgen reading club, presented by Scott Adamson. Chapter 8 from Population Genetics book by Graham Coop.

NPD Summary Statistics. Look at weighted PCA after data cleaning. It appears that there is not much distinguishable from the principal components.

Make changes and create pull request on DSC for supporting collections.abc and rpy2 v3.5.11.
Numerical experiments for dense linear regression for correlated variables.
Plot results (ELBO, RMSE, runtime) for dense linear regression for correlated variables.
Run more replicates for the above simulation.
Seminar by Julia Domingo

The GradVI trendfiltering runtime experiments have failed. Troubleshoot: 40G memory is not enough for the jobs. I removed jobs with \(n = (10^5, 10^6)\) and submitted new jobs with 100G of memory in interactive node. The idea is to run the large jobs separately for GradVI.
Simulate first order trend filtering data with less memory.
How to run GradVI without generating \(\mathbf{H}\) matrix? I derived equations for obtaining \(d_j\) without generating the \(\mathbf{H}\) matrix and made the required changes in the GradVI software to run without generating the matrix. This introduced a bug which prevents running trend filtering for higher orders.