About

Here, I compare the prediction accuracy of several linear regression methods using simulation examples of Kim, Wang, Carbonetto and Stephens (KWCS 2020). It reproduces four simulation scenarios (see below) from Figure 2 and 3 of their manuscript and adds four other scenarios. I compare the combinations of signal strength (PVE $= 0.5$ and $0.95$) and correlation of the predictors ($\rho = 0$ and $0.95$) with sample size $n = 100$ ($< p = 200$) and $n = 500$ ($> p$), which gives the 8 different scenarios presented here.

The four simulation scenarios which are reproduced from (KWCS 2020) are:

Low-dimension ($n = 500$, $\rho = 0$, PVE $= 0.5$)
High-dimension ($n = 100$, $\rho = 0$, PVE $= 0.5$)
Strong-signal ($n = 500$, $\rho = 0$, PVE $= 0.95$)
EquiCorrGauss ($n = 100$, $\rho = 0.95$, PVE $= 0.5$)

The setting where Fabio Morgante found Lasso to be better than Mr.ASH would be equivalent to ($n = 100$, $\rho = 0$, PVE $= 0.95$).

The simulation pipeline is implemented using Dynamic Statistical Comparisons (DSC).

Importing packages

Non-standard packages include DSC and PyMir. The simulation repository needs to be in path for importing some of the utilities.

import pandas as pd
import numpy as np
import math
import os
import sys
import collections

srcdir = "/home/saikat/Documents/work/ebmr/simulation/eb-linreg-dsc"
sys.path.append(os.path.join(srcdir, "analysis"))
import dscrutils2py as dscrutils
import methodprops
import methodplots

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from pymir import mpl_stylesheet
from pymir import mpl_utils
from pymir import pd_utils
mpl_stylesheet.banskt_presentation()

DSC results

I have run the simulations using the following settings.

highdims = (100, 200)
lowdims  = (500, 200)

# fraction of non-zero predictors, sfrac = p_causal / p
sfracs   = [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]

# PVE
pve_list = [0.5, 0.95]

# rho
rho_list = [0, 0.95]

# Output directory
dsc_outdir = os.path.join(srcdir, "dsc/dsc_result")

Read the results of the simulation and store it in a DataFrame

targets = ["simulate", "simulate.dims", "simulate.se", "simulate.rho",
           "simulate.sfrac", "simulate.pve", "fit", "fit.DSC_TIME", "mse.err"]
dscout = dscrutils.dscquery(dsc_outdir, targets)
dscout['score1'] = np.sqrt(dscout['mse.err'])/dscout['simulate.se']

Calling: dsc-query /home/saikat/Documents/work/ebmr/simulation/eb-linreg-dsc/dsc/dsc_result -o /tmp/RtmpRh1NxB/file66416d28fb15.csv --target "simulate simulate.dims simulate.se simulate.rho simulate.sfrac simulate.pve fit fit.DSC_TIME mse.err" --force 
Loaded dscquery output table with 21280 rows and 12 columns.

dscout

Select the methods to be displayed in the figures.

whichmethods = ["l0learn",
                "lasso",
                #"lasso_1se",
                "ridge",
                "elastic_net",
                #"elastic_net_1se", 
                "scad", 
                "mcp",
                "blasso", 
                "bayesb", 
                "susie", 
                "varbvs", 
                "varbvsmix", 
                "mr_ash"]

High dimension setting (p > n)

Simulations were performed with n = 100, p = 200.

highdim_condition = [f"$(simulate.dims) == '({highdims[0]},{highdims[1]})'"]
resdf1            = pd_utils.select_dfrows(dscout, highdim_condition)

methodplots.create_figure_prediction_error(whichmethods, resdf1, highdims, rho_list, pve_list, sfracs)

Low dimension setting (p < n)

Simulations were performed with n = 500, p = 200.

lowdim_condition = [f"$(simulate.dims) == '({lowdims[0]},{lowdims[1]})'"]
resdf2           = pd_utils.select_dfrows(dscout, lowdim_condition)

methodplots.create_figure_prediction_error(whichmethods, resdf2, lowdims, rho_list, pve_list, sfracs)

Computational time

Shown for one particular simulation scenario. Needs to be extended for varying $n$ and $p$.

fig = plt.figure()
ax1 = fig.add_subplot(111)

# Choose settings
dims    = lowdims
pve     = 0.5
rho     = 0
sfrac   = 0.5
data    = resdf2.copy() if dims == lowdims else resdf1.copy()
s       = max(1, int(sfrac * dims[1]))

# Plot
methodplots.single_plot_computational_time(ax1, data, "fit.DSC_TIME", whichmethods, pve, rho, dims, sfrac)
ax1.set_xlabel("Computational Time (in seconds)")
ax1.set_title(r"($n=$" + f"{dims[0]}, " + r"$p=$" + f"{dims[1]}, " + r"$s=$" + f"{s}, " + 
              f"PVE = {pve}, " + 
              r"$\rho = $" + f"{rho})",
              pad = 40)

plt.show()

	DSC	simulate	simulate.dims	simulate.se	simulate.rho	simulate.sfrac	simulate.pve	fit	fit.DSC_TIME	mse.err	score1
0	1	indepgauss	(500,200)	3.462448	0.00	0.010	0.50	l0learn	0.417	12.563997	1.023719
1	1	indepgauss	(100,200)	0.861522	0.00	0.010	0.50	l0learn	0.872	0.654125	0.938780
2	1	indepgauss	(500,200)	1.905777	0.00	0.025	0.50	l0learn	0.407	3.574130	0.992003
3	1	indepgauss	(100,200)	1.249624	0.00	0.025	0.50	l0learn	0.800	1.806580	1.075596
4	1	indepgauss	(500,200)	3.063747	0.00	0.050	0.50	l0learn	0.372	9.284629	0.994556
...	...	...	...	...	...	...	...	...	...	...	...
21275	20	equicorrgauss	(100,200)	2.003817	0.95	0.250	0.95	mcp	0.839	10.202592	1.594033
21276	20	equicorrgauss	(500,200)	0.856860	0.95	0.500	0.95	mcp	17.814	1.525528	1.441453
21277	20	equicorrgauss	(100,200)	0.841918	0.95	0.500	0.95	mcp	0.866	6.398867	3.004566
21278	20	equicorrgauss	(500,200)	0.964054	0.95	1.000	0.95	mcp	18.341	5.848087	2.508451
21279	20	equicorrgauss	(100,200)	2.148883	0.95	1.000	0.95	mcp	0.789	15.720326	1.845092