BioPig: a Hadoop-based analytic toolkit for large-scale sequence data

Nordberg Henrik; Bhatia Karan; Wang Kai; Wang Zhong

首页> 外文期刊>Bioinformatics >BioPig: a Hadoop-based analytic toolkit for large-scale sequence data

【24h】

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data

机译：BioPig：基于Hadoop的大规模序列数据分析工具包

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Motivation: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this 'data deluge', here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation. Results: We built BioPig on the Apache's Hadoop MapReduce system and the Pig data flow language. Compared with traditional serial and MPI-based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at National Energy Research Scientific Computing Center and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.

机译：动机：测序技术的最新革命导致序列数据呈指数增长。结果，大多数当前的生物信息学工具因无法扩展数据而变得过时。为了解决这种“数据泛滥”，我们在这里介绍BioPig序列分析工具包，作为可扩展到数据和计算的解决方案之一。结果：我们在Apache的Hadoop MapReduce系统和Pig数据流语言上构建了BioPig。与传统的基于串行和基于MPI的算法相比，BioPig具有三个主要优势：首先，BioPig的可编程性大大缩短了并行生物信息学应用程序的开发时间;其次，对多达500 Gb序列的BioPig进行测试表明，它可以随数据大小自动缩放。最后，如在国家能源研究科学计算中心的Magellan系统和Amazon Elastic Compute Cloud上测试的那样，可以将BioPig无需修改即可移植到许多Hadoop基础设施上。总而言之，BioPig代表了一个新颖的程序框架，具有极大地加速数据密集型生物信息学分析的潜力。

著录项

来源
《Bioinformatics》 |2013年第23期|共6页
作者
Nordberg Henrik; Bhatia Karan; Wang Kai; Wang Zhong;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类生物工程学（生物技术）;
关键词

相似文献

外文文献
中文文献
专利

1. BioPig: a Hadoop-based analytic toolkit for large-scale sequence data [J] . Nordberg Henrik, Bhatia Karan, Wang Kai, Bioinformatics . 2013,第23期

机译：BioPig：基于Hadoop的大规模序列数据分析工具包
2. Design of Hadoop-based Framework for Analytics of Large Synchrophasor Datasets [J] . Matthew Edwards, Aseem Rambani, Yifeng Zhu, Procedia Computer Science . 2012,第1期

机译：基于Hadoop的大型同步相量数据集分析框架的设计
3. Integrated Data Repository Toolkit (IDRT) A Suite of Programs to Facilitate Health Analytics on Heterogeneous Medical Data [J] . Bauer C. R. K. D., Ganslandt T., Baum B., Methods of information in medicine . 2016,第2期

机译：集成数据存储库工具箱（IDRT）一套程序，可促进对异构医学数据的健康分析
4. Bitcoin Data Analytics: Exploring Research Avenues and Implementing a Hadoop-Based Analytics Framework [C] . Raj Sanjay Shah, Ashutosh Bhatia International Conference on Advanced Information Networking and Applications . 2020

机译：比特币数据分析：探索研究途径并实施基于Hadoop的分析框架
5. A Hadoop-based storage system for big spatio-temporal data analytics. [D] . Tan, Haoyu. 2012

机译：基于Hadoop的存储系统，可进行大的时空数据分析。
6. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data [O] . Namita T. Gupta, Jason A. Vander Heiden, Mohamed Uduman, -1

机译：Change-O：分析大规模B细胞免疫球蛋白库测序数据的工具包
7. Exploratory Research on Developing Hadoop-Based Data Analytics Tools [O] . Henry Novianus Palit, Lily Puspa Dewi, Andreas Handojo, 2017

机译：基于Hadoop的数据分析工具开发的探索性研究

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data

摘要

著录项

相似文献

相关主题

期刊订阅