Topics in Big Data Statistics: Subbagging and Robust Distributed Computing

Date

Authors

Li, Xian

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The analysis of big data is becoming increasingly pervasive nowadays, as the sheer volume of these data provides unique opportunities to discover the subtle population patterns that cannot be achieved by analyzing small datasets, while presenting additional computational challenges to traditional statistical approaches. This thesis focuses on developing methods for big data analysis. Specifically, Chapters 2 and 3 explore subbagging (subsample aggregating) methods for estimation and variable selection under computational constraints, while Chapter 4 addresses distributed computing, where data are stored across multiple locations. Due to the volume of big data, analyzing all the data on a single computer is often infeasible given limited computational resources, for example memory constraints. To address this challenge, Chapter 2 introduces subbagging estimation, a method that involves randomly drawing multiple subsamples, each of the same size, using sampling without replacement. Each subsample produces a subsample estimator, and their average forms the subbagging estimator. Theoretically, according to the incomplete U-statistics framework with an infinite-order kernel, we show that the subbagging estimator maintains the same convergence rate as the full sample estimator under certain conditions. Asymptotic normality is established, revealing an inflation in its asymptotic variance compared to the full sample estimator. Following Chapter 2, Chapter 3 develops a subbagging approach for variable selection in regression. The proposed subbagging approach not only ensures that variable selection is scalable given the constraints of computational resources, but also preserves the statistical efficiency of the resulting estimator. In particular, we propose a subbagging loss function that aggregates the least-squares approximations of the subsample loss function from each subsample. Subsequently, we penalize the subbagging loss function via an adaptive LASSO-type regularizer, and obtain a regularized estimator to achieve the subbagging variable selection. We then demonstrate that the regularized estimator also achieves the same convergence rate as the full sample analysis and possesses the oracle properties. In addition, we propose a subbagging Bayesian information criterion to select a proper regularization parameter, ensuring that the regularized estimator achieves selection consistency. When data are stored across multiple locations, directly pooling all the data together for statistical analysis may be impossible due to communication costs and privacy concerns. Distributed computing systems allow the analysis of such data, by getting local servers to process their own statistical analyses and using a central processor to aggregate the local statistical results. Naive aggregation of local statistics using simple or weighted averages is vulnerable to contamination within a distributed computing system. Chapter 4 investigates a robust aggregation method in distributed computing. We propose and investigate a Huber-type aggregation of M-estimators from local servers when contamination happens in local estimates. Implementation of the Huber-type aggregation needs a robust estimator of the asymptotic variance of the local M-estimators and we achieve this by using the robust spatial median method to aggregate variance estimates from local servers. Theoretically, the Huber-type aggregation processes the same convergence rate as if all the data were pooled, and its asymptotic normality is established for inference purposes. These results further enable the development of a two-step approach for sequentially detecting the contaminated local M- and variance estimates. Simulation studies in each chapter validate the numerical performance of each method and lend further support to justify theoretical results. Finally, the developed methods are applied to real datasets within each chapter to demonstrate their usefulness.

Description

Keywords

Citation

Source

Book Title

Entity type

Access Statement

License Rights

Restricted until

Downloads