...
首页> 外文期刊>Signal Processing, IEEE Transactions on >Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data
【24h】

Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data

机译:强大,可扩展且快速的Bootstrap方法来分析大规模数据

获取原文
获取原文并翻译 | 示例

摘要

In this paper we address the problem of performing statistical inference for large scale data sets i.e., Big Data. The volume and dimensionality of the data may be so high that it cannot be processed or stored in a single computing node. We propose a scalable, statistically robust and computationally efficient bootstrap method, compatible with distributed processing and storage systems. Bootstrap resamples are constructed with smaller number of distinct data points on multiple disjoint subsets of data, similarly to the bag of little bootstrap method (BLB) [A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A scalable bootstrap for massive data,” J. Roy. Statist. Soc.: Ser. B (Statist. Methodol.), vol. 76, no. 4, pp. 795–816, 2014]. The disjoint subsets are significantly smaller than the original full data set and they may be processed in different storage and computing units in parallel. Then significant savings in computation is achieved by avoiding the recomputation of the estimator for each bootstrap sample. Instead, a computationally efficient fixed-point estimation equation is analytically solved via a smart approximation following the Fast and Robust Bootstrap method (FRB) [M. Salibián-Barrera, S. Van Aelst, and G. Willems, “Fast and robust bootstrap,” Statist. Methods Appl., vol. 17, no. 1, pp. 41–71, 2008]. Our proposed bootstrap method facilitates the use of highly robust statistical methods in analyzing large scale data sets. The favorable statistical properties of the method are established analytically. Numerical examples demonstrate scalability, low complexity and robust statistical performance of the method in analyzing large data sets.
机译:在本文中,我们解决了对大型数据集(即大数据)执行统计推断的问题。数据的数量和维数可能很高,以致无法在单个计算节点中进行处理或存储。我们提出了一种可扩展的,统计上可靠且计算效率高的引导程序方法,该方法与分布式处理和存储系统兼容。引导程序重采样在多个不相交的数据子集上使用较少数量的不同数据点进行构造,类似于小引导程序袋(BLB)[A. Kleiner,A。Talwalkar,P。Sarkar和M. I. Jordan,“海量数据的可伸缩引导程序”,J。Roy。统计员。 SOC:Ser。 B(统计学家方法),第一卷。 76号4,第795–816页,2014年]。不相交的子集显着小于原始完整数据集,并且可以在不同的存储和计算单元中并行处理它们。然后,通过避免每个自举样本的估计器重新计算,可以节省大量计算时间。取而代之的是,遵循快速和稳健的自举方法(FRB)[M.],通过智能逼近来解析计算有效的定点估计方程。 Salibián-Barrera,S。Van Aelst和G. Willems,“快速而强大的引导程序”,统计学家。方法应用,第一卷。 17号1,第41-71页,2008年]。我们提出的引导程序方法有助于在分析大规模数据集时使用高度可靠的统计方法。通过分析确定了该方法的有利统计特性。数值算例表明了该方法在分析大数据集时的可扩展性,低复杂度和强大的统计性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号