首页> 外文期刊>Frontiers in Physics >Outlier Mining Methods Based on Graph Structure Analysis
【24h】

Outlier Mining Methods Based on Graph Structure Analysis

机译:基于图形结构分析的异常挖掘方法

获取原文
           

摘要

Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap nonlinear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.
机译:高维数据集中的异常检测是跨学科的基本和挑战性问题,这些问题也具有实际意义,因为删除培训集中的异常值提高了机器学习算法的性能。虽然文献中已经提出了许多异常挖掘算法,但它们往往对特定类型的数据集(时间序列,图像,视频等)有效或有效。在这里,我们提出了两种可以应用于通用数据集的方法,只要存在数据集的元素对之间的距离有意义的距离。这两种方法都通过定义图形来开始,其中节点是数据集的元素,并且链路具有关联权重,其是节点之间的距离。然后,第一种方法基于图形的渗透(即碎片)来分配异常值分数。第二种方法使用流行的ISOMAP非线性维度降低算法,并通过将Geodesic距离与减小空间中的距离进行比较来分配异常值。我们在实际和合成数据集上测试这些算法,并显示它们要么倾向于与其他流行的异常转口检测方法相表执行。渗透方法的主要优点是,免费参数,因此,它不需要任何培训; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号