【24h】

Scalable Robust Covariance and Correlation Estimates for Data Mining

机译:用于数据挖掘的可扩展鲁棒协方差和相关估计

获取原文
获取外文期刊封面目录资料

摘要

Covariance and correlation estimates have important applications in data mining. In the presence of outliers, classical estimates of covariance and correlation matrices are not reliable. A small fraction of outliers, in some cases even a single outlier, can distort the classical covariance and correlation estimates making them virtually useless. That is, correlations for the vast majority of the data can be very erroneously reported; principal components transformations can be misleading; and multidimensional outlier detection via Mahalanobis distances can fail to detect outliers. There is plenty of statistical literature on robust covariance and correlation matrix estimates with an emphasis on affine-equivariant estimators that possess high breakdown points and small worst case biases. All such estimators have unacceptable exponential complexity in the number of variables and quadratic complexity in the number of observations. In this paper we focus on several variants of robust covariance and correlation matrix estimates with quadratic complexity in the number of variables and linear complexity in the number of observations. These estimators are based on several forms of pairwise robust covariance and correlation estimates. The estimators studied include two fast estimators based on coordinate-wise robust transformations embedded in an overall procedure recently proposed by [14]. We show that the estimators have attractive robustness properties, and give an example that uses one of the estimators in the new Insightful Miner data mining product.
机译:协方差和相关估计在数据挖掘中具有重要的应用。在存在异常值的情况下,协方差和相关矩阵的经典估计不可靠。一小部分离群值,在某些情况下甚至是单个离群值,都可能使经典协方差和相关性估计值失真,从而使它们几乎毫无用处。也就是说,绝大多数数据的相关性可能会被错误地报告。主成分转换可能会产生误导;并且通过马哈拉诺比斯距离进行多维离群值检测可能无法检测到离群值。关于鲁棒协方差和相关矩阵估计,有大量统计文献,重点是具有高分解点和较小的最坏情况偏差的仿射等价估计量。所有这些估计量在变量数量上具有不可接受的指数复杂度,而在观察数量上具有二次复杂性。在本文中,我们着重研究鲁棒协方差和相关矩阵估计的几种变体,变量数量为二次复杂度,而观察值数量为线性复杂度。这些估计器基于成对鲁棒协方差和相关估计的几种形式。研究的估计量包括基于[14]最近提出的整体过程中嵌入的基于坐标的鲁棒变换的两个快速估计量。我们证明了估计量具有吸引人的鲁棒性,并给出了一个使用新的Insightful Miner数据挖掘产品中的估计量之一的示例。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号