首页> 外文会议>Workshop on Algorithm Engineering and Experiments >Engineering External Memory Induced Suffix Sorting
【24h】

Engineering External Memory Induced Suffix Sorting

机译:工程外部记忆引起后缀排序

获取原文

摘要

Suffix sorting - determining the lexicographical order of all the suffixes of a string - is one of the most important problems in string processing. The resulting data structure is called the suffix array (SA) and underpins dozens of applications in bioinformatics, data compression, and information retrieval. When the size of the input string or the SA exceeds that of internal memory (RAM), an external memory (EM) suffix sorting algorithm must be used. The most scalable of these EM methods is due to Bingmann et al. (Proc. ALENEX 2013), and is essentially a careful disk-based implementation of the so-called induced sorting technique used by the fastest RAM suffix sorting algorithms. In this paper we show how to greatly improve the efficiency of induced suffix sorting in external memory via a non-trivial reorganization of the computation involved. Our experiments show this new approach to be twice as fast as state-of-the-art methods, while, just as significantly, using a third of the disk memory. We also demonstrate the efficacy of our implementation for handling strings on large alphabets (with many millions of distinct symbols), which is important, e.g., for applications in natural language processing and information retrieval, but unaddressed by previous EM suffix sorting implementations. Our implementation uses a (EM) radix heap data structure and, as a side result of independent interest, we introduce a new operation for radix heaps and other monotone priority queues called min-comp, which we believe to be useful for many other applications, including discrete event simulation and sweep line algorithms, even in internal memory.
机译:后缀排序 - 确定字符串的所有后缀的词典顺序 - 是字符串处理中最重要的问题之一。生成的数据结构称为后缀阵列(SA),并在生物信息学,数据压缩和信息检索中为数十个应用程序。当输入字符串或SA的大小超过内部存储器(RAM)时,必须使用外部存储器(EM)后缀排序算法。这些EM方法的最可扩展是由于Bingmann等人。 (Proc。Alenex 2013),并且基本上是基于磁盘的基于磁盘的实现最快的RAM后缀分类算法使用的所谓的诱导分类技术。在本文中,我们展示了如何通过涉及的计算的非琐事重组来大大提高外部存储器中的诱导后缀排序的效率。我们的实验表明,这种新方法是最先进的方法的两倍,而使用三分之一的磁盘内存的显着。我们还展示了我们实现大字母表(具有数百万个不同符号)的串的实现的效果,这很重要,例如,用于自然语言处理和信息检索的应用,但是通过以前的EM后缀排序实现而无法解决。我们的实现使用(EM)基数堆数据结构,作为独立兴趣的副作用,我们为CADIX堆和其他称为MIN-COMP的单调优先级队列的新操作,我们认为对许多其他应用有用,包括离散事件仿真和扫描线算法,即使在内部存储器中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号