Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking

Mohotti Wathsala Anupama; Nayak Richi

首页> 外文期刊>ACM transactions on knowledge discovery from data >Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking

【24h】

Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking

机译：使用罕见频率和排名的文本语料库中有效的异常探测

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Outlier detection in text data collections has become significant due to the need of finding anomalies in the myriad of text data sources. High feature dimensionality, together with the larger size of these document collections, presents a need for developing accurate outlier detection methods with high efficiency. Traditional outlier detection methods face several challenges including data sparseness, distance concentration, and the presence of a larger number of sub-groups when dealing with text data. In this article, we propose to address these issues by developing novel concepts such as presenting documents with the rare document frequency, finding ranking-based neighborhood for similarity computation, and identifying sub-dense local neighborhoods in high dimensions. To improve the proposed primary method based on rare document frequency, we present several novel ensemble approaches using the ranking concept to reduce the false identifications while finding the higher number of true outliers. Extensive empirical analysis shows that the proposed method and its ensemble variations improve the quality of outlier detection in document repositories as well as they are found scalable compared to the relevant benchmarking methods.

机译：由于需要在文本数据源中找到异常，文本数据收集中的异常检测变得显着。高特征维度，以及这些文档收集的较大大小，呈现了高效率的精确异常检测方法。传统的异常值检测方法面临几种挑战，包括数据稀疏，距离浓度以及在处理文本数据时的较数子组的存在。在本文中，我们建议通过开发具有罕见文档频率的文档等新颖概念来解决这些问题，以呈现罕见的文档频率，找到基于排名的相似性计算，并在高维中识别子密集的本地邻居。为了提高基于稀有文档频率的提出的主要方法，我们使用排名概念提出了几种新颖的集合方法，以减少错误标识，同时找到更高的真实异常值。广泛的经验分析表明，该方法及其集合变化提高了文档存储库中的异常检测质量，以及与相关的基准方法相比找到可扩展。

著录项

来源
《ACM transactions on knowledge discovery from data》 |2020年第6期|71.1-71.30|共30页
作者
Mohotti Wathsala Anupama; Nayak Richi;
展开▼
作者单位

Queensland Univ Technol GPO Box 2434 Brisbane Qld Australia|Univ Ruhuna Dept Comp Sci Matara Sri Lanka;

Queensland Univ Technol GPO Box 2434 Brisbane Qld Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Outlier detection; high dimensional data; k-occurrences; ranking function; term-weighting;

机译：异常检测;高维数据;k发生;排名功能;术语加权;

相似文献

外文文献
中文文献
专利

1. Efficient population assignment and outlier detection in human populations using biallelic markers chosen by principal component-based rankings. [J] . Raaum RL, Wang AB, Al-Meeri AM, BioTechniques . 2010,第6期

机译：使用基于主要成分的排名选择的双等位标记，在人群中进行有效的人群分配和离群值检测。
2. Consensus Outlier Detection Using Sum of Ranking Differences of Common and New Outlier Measures Without Tuning Parameter Selections (vol 89, pg 5087, 2017) [J] . Brownfield Brett, Kalivas John H. Analytical chemistry . 2017,第17期

机译：使用常见和新的异常措施的排名差异的总和进行共识异常检测，无需调整参数选择（Vol 89，PG 5087,2017）
3. The Natural Stories corpus: a reading-time corpus of English texts containing rare syntactic constructions [J] . Futrell Richard, Gibson Edward, Tily Harry J., Language Resources and Evaluation . 2021,第1期

机译：自然故事语料库：包含罕见的句法结构的英语文本的阅读时间语料库
4. Analyzing Malay Stemmer Performance Towards Fuzzy Logic Ranking Function on Malay Text Corpus [C] . Shaiful Bakhtiar Rodzman, Mohamad Fitri Izuan Abdul Ronie, Normaly Kamal Ismail, International Conference on Information Retrieval and Knowledge Management . 2018

机译：马来文本语料库对模糊逻辑排序函数的马来词干性能分析
5. Toward accurate and efficient outlier detection in high dimensional and large data sets. [D] . Nguyen, Minh Quoc. 2010

机译：致力于在高维和大数据集中进行精确有效的离群值检测。
6. Ranking cancer drivers via betweenness-based outlier detection and random walks [O] . Cesim Erten, Aissa Houdjedj, Hilal Kazan 2021

机译：通过基于间的异常值检测和随机散步来排名癌症驱动程序
7. Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking [O] . Wathsala Anupama Mohotti, Richi Nayak 2020

机译：使用罕见频率和排名的文本语料库中有效的异常探测

Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking

摘要

著录项

相似文献

相关主题

期刊订阅