...
首页> 外文期刊>BMC Bioinformatics >Shared data science infrastructure for genomics data
【24h】

Shared data science infrastructure for genomics data

机译:基因组数据的共享数据科学基础架构

获取原文
           

摘要

Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq’s 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.
机译:由于在组织,提取和分析相关数据方面存在重大障碍,因此创建可伸缩的计算基础结构以分析数据存储库中包含的大量信息非常困难。需要像Boag这样的共享数据科学基础结构来有效处理和解析包含在大型数据存储库中的数据。 Boag的主要功能受到现有语言的启发,可用于数据密集型计算,并且可以轻松集成生物数据存储库中的数据。作为概念的证明,用于基因组学的Boa Boag已用于分析RefSeq的153,848批注(GFF)和程序集(FASTA)文件元数据。 Boag通过利用特定领域的语言(使用Hadoop基础结构)来缩小存储空间,从而很好地扩展并需要更少的代码行,从而相对于现有解决方案(如Python和MongoDB)进行了重大改进。我们通过Boag执行脚本来回答有关RefSeq中基因组的问题。我们确定了沉积的最大和最小基因组,探索了2016年后装配的外显子频率,确定了最常用的细菌基因组装配程序,并探讨了自2016年以来动物基因组装配如何改善的情况。Boag数据库显着减少了所需原始存储量数据,并且由于在计算过程中Hadoop基础架构的自动并行化和分布,大大提高了查询大型数据集的能力。为了跟上我们产生生物学数据的能力,需要创新的方法。 Boag的共享数据科学基础架构为研究人员提供了更多的途径,使他们能够以新的方式有效地探索数据。我们展示了使用RefSeq数据库的领域特定语言Boag的潜力,以探索沉积的基因组组装和注释如何随时间变化。这是Boag如何与大型生物数据集结合使用的一个小例子。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号