【24h】

Space-Efficient Detection of Unusual Words

机译:节省空间的异常单词检测

获取原文

摘要

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of O(σ~2 log~2 n) bits, where n is the length of the string and σ is the size of the alphabet. The size of the stack is o(n) except for very large values of σ. We further improve the algorithm by removing its time dependency on σ, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.
机译:根据IID或马尔可夫模型,检测文本中出现的所有字符串的频率比预期的频率高或低,这是字符串挖掘中的一个基本问题,但是当前的算法基于空间效率低或导致速度变慢的数据结构,并且当前的实现方式在实践中无法扩展到基因组或元基因组。在本文中,我们设计了一种基于字符串后缀树的算法,以仅使用基于Burrows-Wheeler变换构建的小型数据结构以及O(σ〜2 log〜2 n)位的堆栈,其中n是字符串的长度,σ是字母的大小。除了非常大的σ值之外,堆栈的大小为o(n)。我们通过以下方法进一步改进算法:消除对σ的时间依赖性,仅报告字符串的最大重复次数和最小稀有单词的子集,并检测并计分在字符串中未出现的候选代表性不足的字符串。我们的算法非常实用,可以直接在BWT上运行,因此可以立即将其应用于以这种形式可用的许多现有数据集,从而将字符串挖掘问题返回到可管理的规模。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号