首页> 外文会议>TPC Technology Conference on Performance Evaluation and Benchmarking >Big-SeqDB-Gen: A Formal and Scalable Approach for Parallel Generation of Big Synthetic Sequence Databases
【24h】

Big-SeqDB-Gen: A Formal and Scalable Approach for Parallel Generation of Big Synthetic Sequence Databases

机译:Big-SEQDB-Gen:正式和可扩展的直接生成大合成序列数据库的方法

获取原文

摘要

The recognition that data is of big economic value and the significant hardware achievements in low cost data storage, high-speed networks and high performance parallel computing, foster new research directions on large-scale knowledge discovery from big sequence data-bases. There are many applications involving sequence databases, such as customer shopping sequences, web clickstreams, and biological sequences. All these applications are concerned by the big data problem. There is no doubt that fast mining of billions of sequences is a challenge. However, due to the non availability of big data sets, it is not possible to assess knowledge discovery algorithms over big sequence databases. For both privacy and security concerns, Companies do not disclose their data. In the other hand, existing synthetic sequence generators are not up to the big data challenge. In this paper, first we propose a formal and scalable approach for Parallel Generation of Big Synthetic Sequence Databases. Based on Whitney numbers, the underlying Parallel Sequence Generator (i) creates billions of distinct sequences in parallel and (ii) ensures that injected sequential patterns satisfy user-specified sequences' characteristics. Second, we report a scalability and scale-out performance study of the Parallel Sequence Generator, for various sequence databases' sizes and various number of Sequence Generators in a shared-nothing cluster of nodes.
机译:认可,数据具有重要的经济价值和低成本数据存储,高速网络和高性能并行计算的重要硬件成果,促进了从大序列数据库的大规模知识发现的新的研究方向。有许多应用程序涉及序列数据库,例如客户购物序列,Web结合流和生物序列。所有这些应用程序都受到大数据问题的关注。毫无疑问,数十亿次序列的快速开采是一个挑战。但是,由于大数据集的非可用性,无法评估大序列数据库的知识发现算法。对于隐私和安全问题,公司不会透露他们的数据。另一方面,现有的合成序列发生器不符合大数据挑战。在本文中,首先,我们提出了一种正式和可扩展的方法,用于平行生成大合成序列数据库。基于惠特尼号,底层并行序列发生器(I)并行地产生数十亿个不同的序列,并且(ii)确保注入的顺序模式满足用户指定的序列的特征。其次,我们报告了并行序列生成器的可扩展性和缩放性能研究,用于各种序列数据库的大小以及共享的节点中的共享群集中的各种序列生成器。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号