【24h】

Mining Distance-based Outliers from Large Databases in Any Metric Space

机译:在任何度量空间中从大型数据库中挖掘基于距离的离群值

获取原文
获取原文并翻译 | 示例

摘要

Let R be a set of objects. An object o ∈ R is an outlier, if there exist less than k objects in R whose distances to o are at most r. The values of k, r, and the distance metric are provided by a user at the ran time The objective is to return all outliers with the smallest I/O cost This paper considers a generic version of the problem, where no information is available for outlier computation, except for objects' mutual distances. We prove an upper bound for the memory consumption which permits the discovery of all outliers by scanning the dataset 3 times The upper bound turns out to be extremely low in practice, e.g., less than 1% of R. Since the actual memory capacity of a realistic DBMS is typically larger, we develop a novel algorithm, which integrates our theoretical findings with carefully-designed heuristics that leverage the additional memory to improve I/O efficiency Our technique reports all outliers by scanning the dataset at most twice (in some cases, even once), and significantly outperforms the existing solutions by a factor up to an order of magnitude.
机译:令R为一组对象。如果R中存在少于k个与o的距离最大为r的对象,则对象∈R是一个异常值。 k,r和距离度量的值由用户在运行时提供。目标是返回具有最小I / O成本的所有异常值。本文考虑了问题的通用版本,其中没有可用的信息异常计算,对象之间的相互距离除外。我们证明了内存消耗的上限,可以通过扫描3次数据集来发现所有异常值。实际上,该上限非常低,例如小于R的1%。实际的DBMS通常更大,我们开发了一种新颖的算法,该算法将我们的理论发现与精心设计的启发式算法相结合,利用额外的内存来提高I / O效率。我们的技术通过最多扫描两次数据集来报告所有异常值(在某些情况下, (甚至一次),并且在性能上要比现有解决方案高出一个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号