Optimal subsampling for linear models with heteroscedasticity
DOI:
https://doi.org/10.52933/jdssv.v6i1.167Keywords:
Design of experiments, I−optimality, optimal subsampling, heteroscedasticity.Abstract
In recent years, the size of datasets has dramatically increased. This has encouraged the use of subsampling, where only a subset of the full dataset is used to fit a model in a more computationally efficient manner. Existing methods do not provide much guidance on how to find optimal subsamples for a linear model when the variance of the errors depends on the model covariates through an unknown function. This paper presents three main contributions that aid in finding optimal subsamples in the case of heteroscedastic errors. First, a kernel-based method is proposed for estimating the error variances in the full dataset based on a Latin Hypercube subsample. Second, a generalized version of the Information-Based Optimal Subdata Selection (IBOSS) algorithm is introduced that uses the variance estimates to find subsamples with high D−efficiency. Third, an Approximate Nearest Neighbor Simulated Annealing (ANNSA) algorithm is used to find subsamples that are efficient under the I−optimality criterion, which seeks to minimize integrated prediction error variance. Simulations show that the proposed subsampling algorithms have better D− and I−efficiencies than existing methods. The subsampling methods are used to analyze an airline dataset with over 7 million rows.
