首页> 外文学位 >Low-storage sequential methods for data mining and the analysis of massive datasets.
【24h】

Low-storage sequential methods for data mining and the analysis of massive datasets.

机译:用于数据挖掘和海量数据集分析的低存储顺序方法。

获取原文
获取原文并翻译 | 示例

摘要

In this thesis we study low-storage, sequential estimators and methods for the analysis of massive datasets. A low-storage, sequential estimator is one that uses very low storage relative to the size of the entire dataset and can be updated sequentially as each new data point is observed. An example of an estimator of this type is the sample mean. We may update the sample mean sequentially by keeping track of the total number of points observed, n', and the sum of these points, i=1n' xi. Hence it is a low-storage, sequential estimator because it can be updated after each new point is observed by updating only these two quantities, n' and i=1n' xi. While moment-based examples such as this are helpful for illustrating the nature of the estimators we will be studying, it will in general be more useful to have estimators that are more robust to outliers, such as the sample quantiles. However, utilizing the sample quantiles introduces the additional burdens of increased storage and computation time. Hence we would like an estimator with properties similar to the sample quantiles, but which requires much less storage and computation time. We propose methods for the sequential computation of a single quantile and the simultaneous sequential estimation of an arbitrary set of multiple quantiles. Both methods are shown to estimate the population quantiles with accuracy and variability that is comparable to the sample quantiles. We utilize the quantile output from the extended quantile estimation algorithm for the investigation of various curve fitting techniques, such as smoothing and interpolating cubic splines, in an attempt to obtain an estimate of the entire empirical cumulative distribution function in a functional form that we may work with analytically. Upon obtaining this functional form; we then address the problem of sequential density estimation. Two methods are investigated: sequential kernel density estimation and density estimates through derivatives of the cubic spline fit. Both methods are shown to accurately estimate the unknown underlying density with the cubic spline derivative method having several advantages over the sequential kernel method such as much less computation time and the ability to obtain estimates over the entire range of the dataset. In dimensions of two or greater, quantiles have no direct analog. Instead, we study sequential versions of convex hull peeling algorithms as a way of finding and sequentially tracking the center and depth contours of a data cloud in two or more dimensions. We demonstrate the accuracy and reduced computation time required of the proposed methods by comparing to the existing convex hull peeling methods through simulation studies. Using the contours obtained as output from these algorithms, we use them to estimate the values of the underlying bivariate density for the points in the contours. We then fit a multi-dimensional spline to this three dimensional set of points to obtain an estimate of the entire bivariate density in a functional form.
机译:本文研究了用于大规模数据集分析的低存储顺序估计器和方法。低存储顺序估计器是相对于整个数据集的大小使用非常低的存储,并且可以在观察到每个新数据点时顺序更新的方法。这种类型的估计量的一个示例是样本均值。我们可以通过跟踪观察到的点总数n'和这些点的总和i = 1n'xi来顺序更新采样平均值。因此,它是一种低存储量的顺序估计器,因为可以通过仅更新这两个数量n'和i = 1n'xi在观察到每个新点后对其进行更新。尽管此类基于矩的示例有助于说明我们将要研究的估计量的性质,但总的来说,拥有对异常值(例如样本分位数)更鲁棒的估计量会更有用。但是,利用样本分位数会增加存储和计算时间,增加额外负担。因此,我们希望有一个与样本分位数相似的估计量,但所需的存储和计算时间要少得多。我们提出用于单个分位数的顺序计算和多个分位数的任意集合的同时顺序估计的方法。两种方法都显示出可以以与样本分位数相当的准确性和可变性估计总体分位数。我们利用扩展分位数估计算法的分位数输出来研究各种曲线拟合技术,例如平滑和插值三次样条,以尝试以我们可以工作的函数形式获得整个经验累积分布函数的估计与分析。获得此功能形式后;然后我们解决顺序密度估计的问题。研究了两种方法:顺序核密度估计和通过三次样条拟合的导数进行密度估计。两种方法均显示出可以使用三次样条导数方法准确估计未知的基础密度,而三次样条导数方法具有优于顺序核方法的多个优点,例如,计算时间短得多,并且能够在数据集的整个范围内获得估计值。在两个或两个以上的维度中,分位数没有直接的类似物。取而代之的是,我们研究凸包去皮算法的顺序版本,以寻找并顺序跟踪二维或更多维数据云的中心和深度轮廓。通过仿真研究与现有的凸壳去皮方法进行比较,我们证明了所提方法的准确性和减少的计算时间。使用从这些算法获得的轮廓作为输出,我们使用它们来估计轮廓中各点的基础双变量密度值。然后,我们将多维样条拟合到这组三维点,以函数形式获得整个双变量密度的估计值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号