首页> 外文期刊>Parallel Computing >DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms
【24h】

DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms

机译:DGST:分布式数据并行平台上的高效和可扩展后缀树施工

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

The suffix tree is a fundamental data structure for string processing. It is widely used in many important scenarios such as text processing, information retrieval, and bioinformatics. With the rapid growth of data volume, constructing the suffix tree for large-scale datasets is very time-consuming. To solve this problem, a number of MPI-based parallel algorithms were proposed, but they have limitations in fault tolerance and scalability for large-scale datasets. Recently, there are ever-increasing application demands on efficient algorithms for constructing the suffix tree for large-scale datasets on distributed data-parallel platforms, such as Hadoop and Spark. In this paper, we present DGST, which is an efficient and scalable algorithm for generalized suffix tree construction on distributed data-parallel platforms. DGST consists of two major stages: parallel sub-tree partitioning and parallel sub-tree construction. We first design a novel data partitioning strategy for both two stages in the data-parallel paradigm. Then, we propose an efficient sub-tree partitioning algorithm based on parallel frequency counting. To improve the load balance and amortize the disk I/O costs, we propose an efficient Bin-Packing and Number-Partitioning based task allocation strategy for the sub-tree construction. At the sub-tree construction stage, we further propose a novel data structure LCP-Range and a multi-way LCP-Merge sorting algorithm for parallel LCP array construction. The experimental results on Apache Spark reveal that DGST outperforms the state-of-the-art ERa algorithm with approximately 3 times speedup on both the DNA and English text datasets. Furthermore, DGST achieves near-linear data and node scalability. (C) 2019 Elsevier B.V. All rights reserved.
机译:后缀树是字符串处理的基本数据结构。它广泛用于许多重要的场景,如文本处理,信息检索和生物信息学。随着数据量的快速增长,构建大规模数据集的后缀树非常耗时。为了解决这个问题,提出了许多基于MPI的并行算法,但它们对大规模数据集的容错和可扩展性具有限制。最近,在分布式数据并行平台(例如Hadoop和Spark)上构建用于构建大规模数据集的后缀树的高效算法,有不断增加的应用需求。在本文中,我们呈现DGST,这是分布式数据并行平台上的广义后缀树施工的有效和可扩展的算法。 DGST由两个主要阶段组成:并行子树分区和并行子树构造。我们首先为数据并行范式的两个阶段设计一种新的数据分区策略。然后,我们提出了一种基于并行频率计数的高效的子树分区算法。为了提高负载余额并摊销磁盘I / O成本,我们提出了一种用于子树构造的基于宾馆的基于垃圾包和数字分区的任务分配策略。在子树施工阶段,我们进一步提出了一种新的数据结构LCP范围和用于并行LCP阵列构造的多路LCP合并分选算法。 Apache Spark上的实验结果表明,DGST优于最先进的时代算法,在DNA和英文文本数据集中大约加速了大约3次。此外,DGST实现了近线性数据和节点可扩展性。 (c)2019 Elsevier B.v.保留所有权利。

著录项

  • 来源
    《Parallel Computing》 |2019年第9期|87-102|共16页
  • 作者单位

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

    Nanjing Univ Natl Key Lab Novel Software Technol Nanjing 210023 Jiangsu Peoples R China;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Suffix tree; Generalized suffix tree; Distributed computing; Data-parallel computing; Apache Spark;

    机译:后缀树;广义后缀树;分布式计算;数据并行计算;Apache Spark;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号