...
首页> 外文期刊>ACM transactions on knowledge discovery from data >Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking
【24h】

Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking

机译:使用罕见频率和排名的文本语料库中有效的异常探测

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

Outlier detection in text data collections has become significant due to the need of finding anomalies in the myriad of text data sources. High feature dimensionality, together with the larger size of these document collections, presents a need for developing accurate outlier detection methods with high efficiency. Traditional outlier detection methods face several challenges including data sparseness, distance concentration, and the presence of a larger number of sub-groups when dealing with text data. In this article, we propose to address these issues by developing novel concepts such as presenting documents with the rare document frequency, finding ranking-based neighborhood for similarity computation, and identifying sub-dense local neighborhoods in high dimensions. To improve the proposed primary method based on rare document frequency, we present several novel ensemble approaches using the ranking concept to reduce the false identifications while finding the higher number of true outliers. Extensive empirical analysis shows that the proposed method and its ensemble variations improve the quality of outlier detection in document repositories as well as they are found scalable compared to the relevant benchmarking methods.
机译:由于需要在文本数据源中找到异常,文本数据收集中的异常检测变得显着。高特征维度,以及这些文档收集的较大大小,呈现了高效率的精确异常检测方法。传统的异常值检测方法面临几种挑战,包括数据稀疏,距离浓度以及在处理文本数据时的较数子组的存在。在本文中,我们建议通过开发具有罕见文档频率的文档等新颖概念来解决这些问题,以呈现罕见的文档频率,找到基于排名的相似性计算,并在高维中识别子密集的本地邻居。为了提高基于稀有文档频率的提出的主要方法,我们使用排名概念提出了几种新颖的集合方法,以减少错误标识,同时找到更高的真实异常值。广泛的经验分析表明,该方法及其集合变化提高了文档存储库中的异常检测质量,以及与相关的基准方法相比找到可扩展。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号