I/O Efficient Algorithms for Serial and Parallel Suffix Tree Construction

AMOL GHOTING; KONSTANTIN MAKARYCHEV

首页> 外文期刊>ACM transactions on database systems >I/O Efficient Algorithms for Serial and Parallel Suffix Tree Construction

【24h】

I/O Efficient Algorithms for Serial and Parallel Suffix Tree Construction

机译：串行和并行后缀树构造的I / O高效算法

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we have a scalable suffix tree construction algorithm. The past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting-for example, indexing the entire human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this article, we will empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation-to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality exhibited by existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, we will show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be rearchitected to build a suffix tree in a tiled manner, allowing the execution to operate within a fixed main memory budget when indexing strings of any size. We will also present a parallel extension of our algorithm that is designed for massively parallel systems like the IBM Blue Gene. An experimental evaluation will show that the proposed approach affords an improvement of several orders of magnitude in serial performance when indexing large strings. Furthermore, the proposed parallel extension is shown to be scalable-it is now possible to index the entire human genome on a 1024 processor IBM Blue Gene system in under 15 minutes.

机译：在过去的三十年中，后缀树已成为字符串处理中的基本数据结构。但是，由于后缀树构造不能随输入字符串的大小很好地缩放，因此阻碍了它的广泛应用。随着数据收集和存储技术的进步，大字符串已无处不在，尤其是在涉及文本，时间序列和生物序列数据的新兴应用程序中。为了从这些进步中受益，我们必须拥有可扩展的后缀树构造算法。在过去的几年中，出现了几种基于磁盘的后缀树构造算法。但是，构建时间仍然令人生畏，例如，在具有2 GB物理内存的系统上，对整个人类基因组进行索引仍然需要30多个小时。在本文中，我们将凭经验证明和论证所有现有的后缀树构造算法都存在严重局限性-收集合理的磁盘I / O效率，被索引的输入字符串必须适合主存储器。此限制归因于现有后缀树构造算法所显示的局部性较差，并且抑制了顺序和并行可伸缩性。为了解决这个限制，我们将展示通过精心的算法设计，可以重新构造最简单的后缀树构造算法之一，以分块的方式构建后缀树，从而在索引字符串时允许执行在固定的主内存预算内进行任何大小。我们还将介绍我们算法的并行扩展，该算法是为大规模并行系统（如IBM Blue Gene）设计的。实验评估将表明，在索引大字符串时，所提出的方法可将串行性能提高几个数量级。此外，建议的并行扩展显示为可伸缩的，现在可以在15分钟内在1024个处理器的IBM Blue Gene系统上索引整个人类基因组。

著录项

来源
《ACM transactions on database systems》 |2010年第4期|p.25:1-25:37|共37页
作者
AMOL GHOTING; KONSTANTIN MAKARYCHEV;
展开▼
作者单位

IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598;

IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
suffix tree; parallel; external memory; disk-based; sequence indexing; genome indexing;

机译：后缀树平行;外部存储器基于磁盘序列索引基因组索引;

相似文献

外文文献
中文文献
专利

1. DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms [J] . Zhu Guanghui, Guo Chen, Lu Le, Parallel Computing . 2019,第SEPa期

机译：DGST：在分布式数据并行平台上高效且可扩展的后缀树构造
2. DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms [J] . Zhu Guanghui, Guo Chen, Lu Le, Parallel Computing . 2019,第Sepa期

机译：DGST：分布式数据并行平台上的高效和可扩展后缀树施工
3. Space-Efficient Parallel Construction of Succinct Representations of Suffix Tree Topologies [J] . UWE BAIER, TIMO BELLER, ENNO OHLEBUSCH Journal of experimental algorithmics . 2017,第1期

机译：后缀树拓扑的简洁表示的节省空间的并行构造
4. ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings [C] . Essam Mansour, Amin Allam, Spiros Skiadopoulos, International conference on very large data bases . 2012

机译：ERA：非常长的字符串的高效串行和并行后缀树构造
5. Algorithms for efficient phylogenetic tree construction. [D] . Bansal, Mukul Subodh. 2009

机译：有效的系统树构建算法。
6. Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes [O] . Matteo Comin, Montse Farreras -1

机译：并行连续流：用于全基因组的并行后缀树构建工具
7. 2011. A simple parallel cartesian tree algorithm and its application to suffix tree construction [O] . Julian Shun, Guy E. Blelloch 2015

机译：一种简单的并行笛卡尔树算法及其在后缀树构造中的应用

I/O Efficient Algorithms for Serial and Parallel Suffix Tree Construction

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅