首页> 外文期刊>Bioinformatics >BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
【24h】

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data

机译:BioPig:基于Hadoop的大规模序列数据分析工具包

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. Results: We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.
机译:动机:测序技术的最新革命导致序列数据呈指数增长。结果,大多数当前的生物信息学工具因无法扩展数据而变得过时。为了解决这种“数据泛滥”,我们在这里介绍BioPig序列分析工具包,作为可扩展到数据和计算的解决方案之一。结果:我们在Apache的Hadoop MapReduce系统和Pig数据流语言上构建了BioPig。与传统的基于串行和基于MPI的算法相比,BioPig具有三个主要优势:首先,BioPig的可编程性大大缩短了并行生物信息学应用程序的开发时间;其次,对多达500 Gb序列的BioPig进行测试表明,它可以随数据大小自动缩放。最后,如在国家能源研究科学计算中心的Magellan系统和Amazon Elastic Compute Cloud上测试的那样,可以将BioPig无需修改即可移植到许多Hadoop基础设施上。总而言之,BioPig代表了一个新颖的程序框架,具有极大地加速数据密集型生物信息学分析的潜力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号