Optimal subsampling for linear models with heteroscedasticity

Jiayi Zheng; Dongqi  Fu; Ziqiao Xu; Nicholas  Rios

doi:10.52933/jdssv.v6i1.167

Optimal subsampling for linear models with heteroscedasticity

Authors

Jiayi Zheng George Mason University https://orcid.org/0009-0000-6248-1223
Dongqi Fu George Mason University
Ziqiao Xu George Mason University
Nicholas Rios George Mason University

DOI:

https://doi.org/10.52933/jdssv.v6i1.167

Keywords:

Design of experiments, I−optimality, optimal subsampling, heteroscedasticity.

Abstract

In recent years, the size of datasets has dramatically increased. This has encouraged the use of subsampling, where only a subset of the full dataset is used to fit a model in a more computationally efficient manner. Existing methods do not provide much guidance on how to find optimal subsamples for a linear model when the variance of the errors depends on the model covariates through an unknown function. This paper presents three main contributions that aid in finding optimal subsamples in the case of heteroscedastic errors. First, a kernel-based method is proposed for estimating the error variances in the full dataset based on a Latin Hypercube subsample. Second, a generalized version of the Information-Based Optimal Subdata Selection (IBOSS) algorithm is introduced that uses the variance estimates to find subsamples with high D−efficiency. Third, an Approximate Nearest Neighbor Simulated Annealing (ANNSA) algorithm is used to find subsamples that are efficient under the I−optimality criterion, which seeks to minimize integrated prediction error variance. Simulations show that the proposed subsampling algorithms have better D− and I−efficiencies than existing methods. The subsampling methods are used to analyze an airline dataset with over 7 million rows.

Optimal subsampling for linear models with heteroscedasticity

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License