首页> 美国卫生研究院文献>Genes >NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements
【2h】

NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements

机译:NCBI的病毒发现黑客马拉松:与研究社区合作确定云基础架构要求

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon.
机译:当可以提取大量病毒数据以创建病毒学研究社区的可用索引时,大量病毒数据尚未公开提供。我们假设这种复杂性和规模的工作可以在黑客马拉松的环境中完成。十个团队由来自六个国家的40多名参与者组成,在2019年1月9日开始的为期三天的圣地亚哥圣迭戈州立大学校园活动中,聚集在一起,创建了一套针对复杂生物数据集的众包分析和处理管道。 hackathon将来自国家生物技术信息中心(NCBI)序列读取档案(SRA)的141,676个宏基因组数据集由NCBI工作人员预先组装成连续的程序集(contig)。在黑客马拉松期间,选择了一个由2953个SRA数据集(大约5500万个重叠群)组成的子集,并对其进行了进一步过滤,最小长度为1 kb。这产生了420万(Mio)重叠群,使用BLAST对所有已知病毒基因组进行了比对,系统发生了聚类并分配了元数据。在4.2个Mio重叠群中,有360,000个重叠群用域标记,另外一个包含4400个重叠群的子集被筛选出病毒或类病毒基因。该工作对支持此类工作所需的SRA数据和云基础架构产生了宝贵的见解,揭示了分析瓶颈及其可能的解决方法。主要是:(i)SRA数据的保守组合改进了初始分析步骤; (ii)可以通过包装脚本提升现有的具有弱多线程/多核支持的生物信息软件,以使用计算节点内的所有核; (iii)重新设计用于云基础架构的现有生物信息算法,以促进更广泛的受众使用它; (iv)云基础架构使各种各样的研究人员可以有效地协作。科学发现将在后续活动中扩展。在这里,我们介绍了应用的工作流程,初步结果以及从黑客马拉松中学到的教训。

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号