Shared data science infrastructure for genomics data

Hamid Bagheri; Usha Muppirala; Rick E. Masonbrink; Andrew J. Severin; Hridesh Rajan

首页> 外文期刊>BMC Bioinformatics >Shared data science infrastructure for genomics data

【24h】

Shared data science infrastructure for genomics data

机译：基因组数据的共享数据科学基础架构

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.

机译：由于在组织，提取和分析相关数据方面存在重大障碍，因此创建可伸缩的计算基础结构以分析数据存储库中包含的大量信息非常困难。需要像Boag这样的共享数据科学基础结构来有效处理和解析包含在大型数据存储库中的数据。 Boag的主要功能受到现有语言的启发，可用于数据密集型计算，并且可以轻松集成生物数据存储库中的数据。作为概念的证明，用于基因组学的Boa Boag已用于分析RefSeq的153,848批注（GFF）和程序集（FASTA）文件元数据。 Boag通过利用特定领域的语言（使用Hadoop基础结构）来缩小存储空间，从而很好地扩展并需要更少的代码行，从而相对于现有解决方案（如Python和MongoDB）进行了重大改进。我们通过Boag执行脚本来回答有关RefSeq中基因组的问题。我们确定了沉积的最大和最小基因组，探索了2016年后装配的外显子频率，确定了最常用的细菌基因组装配程序，并探讨了自2016年以来动物基因组装配如何改善的情况。Boag数据库显着减少了所需原始存储量数据，并且由于在计算过程中Hadoop基础架构的自动并行化和分布，大大提高了查询大型数据集的能力。为了跟上我们产生生物学数据的能力，需要创新的方法。 Boag的共享数据科学基础架构为研究人员提供了更多的途径，使他们能够以新的方式有效地探索数据。我们展示了使用RefSeq数据库的领域特定语言Boag的潜力，以探索沉积的基因组组装和注释如何随时间变化。这是Boag如何与大型生物数据集结合使用的一个小例子。

著录项

来源
《BMC Bioinformatics》 |2019年第1期|共13页
作者
Hamid Bagheri; Usha Muppirala; Rick E. Masonbrink; Andrew J. Severin; Hridesh Rajan;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类生物科学;
关键词
Shared Data Science InfrastructureDomain-Specific LanguageBoagGenome Annotation;

机译：共享数据科学基础架构特定于域的语言BoagGenome批注;

相似文献

外文文献
中文文献
专利

1. It depends whose data are being shared: considerations for genomic data sharing policies [J] . Amy L. McGuire, Elizabeth Chiao, Jill O. Robinson, Journal of Law and the Biosciences . 2016,第3期

机译：这取决于要共享的数据：基因组数据共享策略的注意事项
2. Open data, [open] access: linking data sharing and article sharing in the Earth Sciences [J] . Samantha Teplitzky Journal of Librarianship and Scholarly Communication . 2017,第1期

机译：开放数据，[开放]访问：在地球科学中链接数据共享和文章共享
3. Data Sharing Interviews with Crop Sciences Faculty: Why They Share Data and How the Library Can Help [J] . Issues in science and technology librarianship : STS electronic communications . 2013,第2013期

机译：作物科学学院的数据共享访谈：他们为什么共享数据以及图书馆如何提供帮助
4. Study on Geosciences multi-dimensions data model integration and sharing oriented Case study on Data Sharing Network of Earth System Science in China [C] . WANG Juanle International Conference on Geoinformatics;Geoinformatics 2012 . 2010

机译：面向地球科学的多维数据模型集成与共享研究中国地球系统科学数据共享网络案例研究
5. Low-cost Data Analytics for Shared Storage and Network Infrastructures. [D] . Mihailescu, Madalin. 2013

机译：用于共享存储和网络基础架构的低成本数据分析。
6. Shared data science infrastructure for genomics data [O] . Hamid Bagheri, Usha Muppirala, Rick E. Masonbrink, 2019

机译：基因组数据的共享数据科学基础架构
7. Shared Data Science Infrastructure for Genomics Data [O] . Hamid Bagheri, Usha Muppirala, Rick Masonbrink, 2019

机译：基因组学数据共享数据科学基础架构

Shared data science infrastructure for genomics data

摘要

著录项

相似文献

相关主题

期刊订阅