Sparse data-driven random projection in regression for high-dimensional data
DOI:
https://doi.org/10.52933/jdssv.v5i5.138Keywords:
high-dimensional data, dimension reduction, random projection, screeningAbstract
We examine the linear regression problem in a challenging high-dimensional setting with correlated predictors where the degree of sparsity of the coefficients is unknown and can vary from sparse to dense.
In this setting, we propose a combination of
probabilistic variable screening with random projection tools as a computationally efficient approach. In particular, we introduce a new data-driven random projection for dimension reduction in linear regression,
which is motivated by a theoretical bound on the gain in expected prediction error over conventional random projections when using information about the true coefficient. The variables to be included in the projection are screened by considering the correlation of the predictors. To reduce the dependence on fine-tuning choices, we aggregate over an ensemble of linear models. A threshold parameter is introduced to obtain a higher degree of sparsity, which can be chosen together with the number of models in the ensemble by cross-validation.
In extensive simulations, we compare the proposed method with other random projection tools and with well-known methods, and show that it is competitive in terms of prediction in a variety of scenarios with different sparsity and predictor covariance settings, while most competitors are targeted at either sparse or dense settings.
Finally, we illustrate the method on two data applications.