Mining Distance-based Outliers from Large Databases in Any Metric Space

机译：在任何度量空间中从大型数据库中挖掘基于距离的离群值

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Let R be a set of objects. An object o ∈ R is an outlier, if there exist less than k objects in R whose distances to o are at most r. The values of k, r, and the distance metric are provided by a user at the ran time The objective is to return all outliers with the smallest I/O cost This paper considers a generic version of the problem, where no information is available for outlier computation, except for objects' mutual distances. We prove an upper bound for the memory consumption which permits the discovery of all outliers by scanning the dataset 3 times The upper bound turns out to be extremely low in practice, e.g., less than 1% of R. Since the actual memory capacity of a realistic DBMS is typically larger, we develop a novel algorithm, which integrates our theoretical findings with carefully-designed heuristics that leverage the additional memory to improve I/O efficiency Our technique reports all outliers by scanning the dataset at most twice (in some cases, even once), and significantly outperforms the existing solutions by a factor up to an order of magnitude.

机译：令R为一组对象。如果R中存在少于k个与o的距离最大为r的对象，则对象∈R是一个异常值。 k，r和距离度量的值由用户在运行时提供。目标是返回具有最小I / O成本的所有异常值。本文考虑了问题的通用版本，其中没有可用的信息异常计算，对象之间的相互距离除外。我们证明了内存消耗的上限，可以通过扫描3次数据集来发现所有异常值。实际上，该上限非常低，例如小于R的1％。实际的DBMS通常更大，我们开发了一种新颖的算法，该算法将我们的理论发现与精心设计的启发式算法相结合，利用额外的内存来提高I / O效率。我们的技术通过最多扫描两次数据集来报告所有异常值（在某些情况下，（甚至一次），并且在性能上要比现有解决方案高出一个数量级。

著录项

来源
《ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'06); 20060820-23; Philadelphia,PA(US)》|2006年|P.394-403|共10页
会议地点 PhiladelphiaPA(US)
作者
Yufei Tao; Xiaokui Xiao; Shuigeng Zhou;
展开▼
作者单位

Dept. of Computer Science and Engineering Chinese University of Hong Kong Sha Tin, New Territories, Hong Kong;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类数据处理、数据处理系统;计算技术、计算机技术;
关键词
mining; outlier; metric data;

机译：挖掘;异常值;度量数据;

相似文献

外文文献
中文文献
专利

1. Outlier Mining in Medical Databases: An Application of Data Mining in Health Care Management to Detect Abnormal Values Presented In Medical Databases [J] . Varun Kumar, Dharminder Kumar, R.K. Singh International journal of computer science and network security . 2008,第8期

机译：医学数据库中的异常值挖掘：数据挖掘在医疗保健管理中的应用，以检测医学数据库中出现的异常值
2. Fast mining of distance-based outliers in high-dimensional datasets [J] . Amol Ghoting, Srinivasan Parthasarathy, Matthew Eric Otey Data Mining and Knowledge Discovery . 2008,第3期

机译：快速挖掘高维数据集中基于距离的离群值
3. Fast mining of distance-based outliers in high-dimensional datasets [J] . Ghoting A, Parthasarathy S, Otey ME Data mining and knowledge discovery . 2008,第3期

机译：高维数据集中基于距离的离群值的快速挖掘
4. Mining Distance-based Outliers from Large Databases in Any Metric Space [C] . Yufei Tao, Xiaokui Xiao, Shuigeng Zhou ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'06); 20060820-23; Philadelphia,PA(US) . 2006

机译：在任何度量空间中从大型数据库中挖掘基于距离的离群值
5. Empirical performance analysis of two algorithms for mining intentional knowledge of distance-based outliers. [D] . Prasanthi, Enbamoorthy. 2005

机译：两种基于距离的离群值的有意知识挖掘算法的实证性能分析。
6. Data mining application to healthcare fraud detection: a two-step unsupervised clustering method for outlier detection with administrative databases [O] . Michela Carlotta Massi, Francesca Ieva, Emanuele Lettieri 2020

机译：数据挖掘应用于医疗保健欺诈检测：使用管理数据库的异常值检测的两步无监督群集方法
7. Mining distance-based outliers from large databases in any metric space [O] . Yufei Tao, Xiaokui Xiao, et al. 2006

机译：在任何度量空间中从大型数据库中挖掘基于距离的离群值

Mining Distance-based Outliers from Large Databases in Any Metric Space

摘要

著录项

相似文献

相关主题

期刊订阅