DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms

Zhu Guanghui; Guo Chen; Lu Le; Huang Zhi; Yuan Chunfeng; Gu Rong; Huang Yihua

首页> 外文期刊>Parallel Computing >DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms

【24h】

DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms

机译：DGST：在分布式数据并行平台上高效且可扩展的后缀树构造

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The suffix tree is a fundamental data structure for string processing. It is widely used in many important scenarios such as text processing, information retrieval, and bioinformatics. With the rapid growth of data volume, constructing the suffix tree for large-scale datasets is very time-consuming. To solve this problem, a number of MPI-based parallel algorithms were proposed, but they have limitations in fault tolerance and scalability for large-scale datasets. Recently, there are ever-increasing application demands on efficient algorithms for constructing the suffix tree for large-scale datasets on distributed data-parallel platforms, such as Hadoop and Spark. In this paper, we present DGST, which is an efficient and scalable algorithm for generalized suffix tree construction on distributed data-parallel platforms. DGST consists of two major stages: parallel sub-tree partitioning and parallel sub-tree construction. We first design a novel data partitioning strategy for both two stages in the data-parallel paradigm. Then, we propose an efficient sub-tree partitioning algorithm based on parallel frequency counting. To improve the load balance and amortize the disk I/O costs, we propose an efficient Bin-Packing and Number-Partitioning based task allocation strategy for the sub-tree construction. At the sub-tree construction stage, we further propose a novel data structure LCP-Range and a multi-way LCP-Merge sorting algorithm for parallel LCP array construction. The experimental results on Apache Spark reveal that DGST outperforms the state-of-the-art ERa algorithm with approximately 3 times speedup on both the DNA and English text datasets. Furthermore, DGST achieves near-linear data and node scalability. (C) 2019 Elsevier B.V. All rights reserved.

机译：后缀树是用于字符串处理的基本数据结构。它被广泛用于许多重要场景中，例如文本处理，信息检索和生物信息学。随着数据量的快速增长，为大型数据集构建后缀树非常耗时。为了解决这个问题，提出了许多基于MPI的并行算法，但是它们在大型数据集的容错性和可伸缩性方面存在局限性。近年来，对于在分布式数据并行平台（例如Hadoop和Spark）上为大型数据集构造后缀树的高效算法的应用需求不断增长。在本文中，我们提出了DGST，它是一种用于分布式数据并行平台上的通用后缀树构造的高效且可扩展的算法。 DGST包含两个主要阶段：并行子树分区和并行子树构建。我们首先为数据并行范例中的两个阶段设计了一种新颖的数据分区策略。然后，我们提出了一种基于并行频率计数的高效子树划分算法。为了提高负载平衡并分摊磁盘I / O成本，我们提出了一种有效的基于Bin-Packing和Number-Partitioning的任务分配策略，用于子树结构。在子树构建阶段，我们进一步提出了一种新颖的数据结构LCP范围和用于并行LCP阵列构建的多路LCP合并排序算法。在Apache Spark上进行的实验结果表明，DGST在DNA和英文文本数据集上的性能均比最新的ERa算法快3倍左右。此外，DGST实现了近乎线性的数据和节点可伸缩性。（C）2019 Elsevier B.V.保留所有权利。

著录项

来源
《Parallel Computing》 |2019年第9期|87-102|共16页
作者
Zhu Guanghui; Guo Chen; Lu Le; Huang Zhi; Yuan Chunfeng; Gu Rong; Huang Yihua;
展开▼
作者单位

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Suffix tree; Generalized suffix tree; Distributed computing; Data-parallel computing; Apache Spark;

机译：后缀树;广义后缀树;分布式计算;数据并行计算;Apache Spark;

相似文献

外文文献
中文文献
专利

1. DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms [J] . Zhu Guanghui, Guo Chen, Lu Le, Parallel Computing . 2019,第Sepa期

机译：DGST：分布式数据并行平台上的高效和可扩展后缀树施工
2. Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms [J] . Zhu Guanghui, Wang Qian, Tang Qiwei, IEEE Transactions on Parallel and Distributed Systems . 2019,第12期

机译：分布式数据并行平台上的高效且可扩展的功能依赖发现
3. ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform [J] . Bo Zhao1, Hucheng Zhou2, Guoqiang Li3, 大数据挖掘与分析(英文) . 2018,第001期

机译：ZenLDA：分布式数据并行平台上的大规模主题模型培训
4. Large-scale assembly of DNA strings and space-efficient construction of suffix trees [C] . S. Rao Kosaraju, Arthur L. Delcher Annual ACM symposium on Theory of computing;ACM symposium on Theory of computing . 1996

机译：DNA字符串的大规模组装和后缀树的空间高效构造
5. A Data-Parallel Approach for Efficient Resource Utilization in Distributed Serverless Deep Learning [D] . Assogba, Kevin Tunder Elom. 2020

机译：分布式无服务深度学习中有效资源利用的数据并行方法
6. Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes [O] . Matteo Comin, Montse Farreras -1

机译：并行连续流：用于全基因组的并行后缀树构建工具
7. Space-efficient construction of compressed suffix trees [O] . Nicola Prezza, Giovanna Rosone 2021

机译：空间高效建筑压缩后缀树

DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅