Space-Efficient Detection of Unusual Words

机译：节省空间的异常单词检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of O(σ~2 log~2 n) bits, where n is the length of the string and σ is the size of the alphabet. The size of the stack is o(n) except for very large values of σ. We further improve the algorithm by removing its time dependency on σ, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.

机译：根据IID或马尔可夫模型，检测文本中出现的所有字符串的频率比预期的频率高或低，这是字符串挖掘中的一个基本问题，但是当前的算法基于空间效率低或导致速度变慢的数据结构，并且当前的实现方式在实践中无法扩展到基因组或元基因组。在本文中，我们设计了一种基于字符串后缀树的算法，以仅使用基于Burrows-Wheeler变换构建的小型数据结构以及O（σ〜2 log〜2 n）位的堆栈，其中n是字符串的长度，σ是字母的大小。除了非常大的σ值之外，堆栈的大小为o（n）。我们通过以下方法进一步改进算法：消除对σ的时间依赖性，仅报告字符串的最大重复次数和最小稀有单词的子集，并检测并计分在字符串中未出现的候选代表性不足的字符串。我们的算法非常实用，可以直接在BWT上运行，因此可以立即将其应用于以这种形式可用的许多现有数据集，从而将字符串挖掘问题返回到可管理的规模。

著录项

来源
《International symposium on string processing and information retrieval;Workshop on compression, text, and algorithms 》|2015年|222-233|共12页
会议地点
作者
Djamal Belazzougui; Fabio Cunial;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. SPACE-EFFICIENT AND ACCURATE FORWARDING LOOP DETECTION METHOD USING BLOOM-FILTER FOR FAST AND RELIABLE INTERNET ROUTING [J] . GHADAH ALDABBAGH, HALABI HASBULLAH, KARAN VERMA, Journal of Theoretical and Applied Information Technology . 2015 ,第3期

机译：使用Bloom-Filter进行快速可靠的互联网路由的高效，精确的前向循环检测方法
2. Pilot considerations of brain activity detection based on difference of English words difficulty levels at recognition of English words [J] . Tatsuya Sasaki, Yoritaka Akimoto, Katsuko T. Nakahira Procedia Computer Science . 2018 ,第1期

机译：基于英语单词识别时英语单词难度级别差异的大脑活动检测的试点注意事项
3. Automatic Detection of Words Associations in Texts Based on Joint Distribution of Words Occurrences [J] . Santoni Daniele, Pourabbas Elaheh Computational Intelligence . 2016 ,第4期

机译：基于单词出现联合分布的文本中单词联想自动检测
4. Space-Efficient Detection of Unusual Words [C] . Djamal Belazzougui, Fabio Cunial International Symposium on String Processing and Information Retrieval . 2015

机译：空间有效地检测不寻常的单词
5. Inertial Detection of Unusual Driving Events for Self-Driving [D] . Wang, Hairong. 2019

机译：对自动驾驶不寻常的驾驶事件的惯性检测
6. Early detection of internet trolls: Introducing an algorithm based on word pairs / single words multiple repetition ratio [O] . Sergei Monakhov, Alexandre Bovet, Alexandre Bovet, 2020

机译：早期检测互联网巨魔：引入基于词对/单词多个重复率的算法
7. Space-efficient detection of unusual words [O] . Belazzougui, Djamal, Cunial, Fabio 2015

机译：节省空间的异常词检测

Space-Efficient Detection of Unusual Words

摘要

著录项

相似文献

相关主题

期刊订阅