首页> 外文期刊>ACM transactions on database systems >I/O Efficient Algorithms for Serial and Parallel Suffix Tree Construction
【24h】

I/O Efficient Algorithms for Serial and Parallel Suffix Tree Construction

机译:串行和并行后缀树构造的I / O高效算法

获取原文
获取原文并翻译 | 示例

摘要

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we have a scalable suffix tree construction algorithm. The past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting-for example, indexing the entire human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this article, we will empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation-to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality exhibited by existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, we will show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be rearchitected to build a suffix tree in a tiled manner, allowing the execution to operate within a fixed main memory budget when indexing strings of any size. We will also present a parallel extension of our algorithm that is designed for massively parallel systems like the IBM Blue Gene. An experimental evaluation will show that the proposed approach affords an improvement of several orders of magnitude in serial performance when indexing large strings. Furthermore, the proposed parallel extension is shown to be scalable-it is now possible to index the entire human genome on a 1024 processor IBM Blue Gene system in under 15 minutes.
机译:在过去的三十年中,后缀树已成为字符串处理中的基本数据结构。但是,由于后缀树构造不能随输入字符串的大小很好地缩放,因此阻碍了它的广泛应用。随着数据收集和存储技术的进步,大字符串已无处不在,尤其是在涉及文本,时间序列和生物序列数据的新兴应用程序中。为了从这些进步中受益,我们必须拥有可扩展的后缀树构造算法。在过去的几年中,出现了几种基于磁盘的后缀树构造算法。但是,构建时间仍然令人生畏,例如,在具有2 GB物理内存的系统上,对整个人类基因组进行索引仍然需要30多个小时。在本文中,我们将凭经验证明和论证所有现有的后缀树构造算法都存在严重局限性-收集合理的磁盘I / O效率,被索引的输入字符串必须适合主存储器。此限制归因于现有后缀树构造算法所显示的局部性较差,并且抑制了顺序和并行可伸缩性。为了解决这个限制,我们将展示通过精心的算法设计,可以重新构造最简单的后缀树构造算法之一,以分块的方式构建后缀树,从而在索引字符串时允许执行在固定的主内存预算内进行任何大小。我们还将介绍我们算法的并行扩展,该算法是为大规模并行系统(如IBM Blue Gene)设计的。实验评估将表明,在索引大字符串时,所提出的方法可将串行性能提高几个数量级。此外,建议的并行扩展显示为可伸缩的,现在可以在15分钟内在1024个处理器的IBM Blue Gene系统上索引整个人类基因组。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号