首页> 外文期刊>Parallel Computing >DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms
【24h】

DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms

机译:DGST:在分布式数据并行平台上高效且可扩展的后缀树构造

获取原文
获取原文并翻译 | 示例

摘要

The suffix tree is a fundamental data structure for string processing. It is widely used in many important scenarios such as text processing, information retrieval, and bioinformatics. With the rapid growth of data volume, constructing the suffix tree for large-scale datasets is very time-consuming. To solve this problem, a number of MPI-based parallel algorithms were proposed, but they have limitations in fault tolerance and scalability for large-scale datasets. Recently, there are ever-increasing application demands on efficient algorithms for constructing the suffix tree for large-scale datasets on distributed data-parallel platforms, such as Hadoop and Spark. In this paper, we present DGST, which is an efficient and scalable algorithm for generalized suffix tree construction on distributed data-parallel platforms. DGST consists of two major stages: parallel sub-tree partitioning and parallel sub-tree construction. We first design a novel data partitioning strategy for both two stages in the data-parallel paradigm. Then, we propose an efficient sub-tree partitioning algorithm based on parallel frequency counting. To improve the load balance and amortize the disk I/O costs, we propose an efficient Bin-Packing and Number-Partitioning based task allocation strategy for the sub-tree construction. At the sub-tree construction stage, we further propose a novel data structure LCP-Range and a multi-way LCP-Merge sorting algorithm for parallel LCP array construction. The experimental results on Apache Spark reveal that DGST outperforms the state-of-the-art ERa algorithm with approximately 3 times speedup on both the DNA and English text datasets. Furthermore, DGST achieves near-linear data and node scalability. (C) 2019 Elsevier B.V. All rights reserved.
机译:后缀树是用于字符串处理的基本数据结构。它被广泛用于许多重要场景中,例如文本处理,信息检索和生物信息学。随着数据量的快速增长,为大型数据集构建后缀树非常耗时。为了解决这个问题,提出了许多基于MPI的并行算法,但是它们在大型数据集的容错性和可伸缩性方面存在局限性。近年来,对于在分布式数据并行平台(例如Hadoop和Spark)上为大型数据集构造后缀树的高效算法的应用需求不断增长。在本文中,我们提出了DGST,它是一种用于分布式数据并行平台上的通用后缀树构造的高效且可扩展的算法。 DGST包含两个主要阶段:并行子树分区和并行子树构建。我们首先为数据并行范例中的两个阶段设计了一种新颖的数据分区策略。然后,我们提出了一种基于并行频率计数的高效子树划分算法。为了提高负载平衡并分摊磁盘I / O成本,我们提出了一种有效的基于Bin-Packing和Number-Partitioning的任务分配策略,用于子树结构。在子树构建阶段,我们进一步提出了一种新颖的数据结构LCP范围和用于并行LCP阵列构建的多路LCP合并排序算法。在Apache Spark上进行的实验结果表明,DGST在DNA和英文文本数据集上的性能均比最新的ERa算法快3倍左右。此外,DGST实现了近乎线性的数据和节点可伸缩性。 (C)2019 Elsevier B.V.保留所有权利。

著录项

  • 来源
    《Parallel Computing》 |2019年第9期|87-102|共16页
  • 作者单位

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

    Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210023, Jiangsu, Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Suffix tree; Generalized suffix tree; Distributed computing; Data-parallel computing; Apache Spark;

    机译:后缀树;广义后缀树;分布式计算;数据并行计算;Apache Spark;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号