首页> 美国卫生研究院文献>PeerJ Computer Science >20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic cross-institutional collaboration
【2h】

20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic cross-institutional collaboration

机译:20 GB在10分钟内:使用公开的社会技术基础设施和务实的跨机构协作将主要生物多样性数据库联系起来

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills.
机译:生物多样性信息通过许多数据库提供,每个数据库都有自己的数据模型,Web服务和数据类型。组合数据库的数据会导致新的见解,但并不容易,因为每个数据库都使用自己的标识符系统。在没有稳定和可互操作的标识符的情况下,数据库通常使用分类名称链接。这种劳动密集型,容易出错,冗长的过程依赖于名称应对权限和模糊匹配算法的可访问版本。要接近链接多样化数据的挑战,需要超过技术。像全球统一开放数据架构(Guoda)这样的新社会合作,将来自Digbio的不同组计算机工程师,服务器资源从高级计算和信息系统(ACIS)实验室,来自EOL和独立开发人员的全球范围数据演示文稿和研究人员是在生物多样性数据集之间找到关系的具体进展所需的。本文将讨论由郭达协作开发的技术解决方案,以便在具有使用案例链接Wikidata和全球生物互动数据库(Globi)的使用情况更快地链接数据库。 Guoda Infrastructure是一个12节点,高性能计算集群,由大约192个线程组成,具有12 TB存储和288 GB内存。使用Guoda,来自Wikidata的20 GB压缩JSON被处理并在大约10-11分钟内与Globi联系起来。通过比较每个系统外部外部的生物多样性标识符的图表,而不是比较名称字符串或依赖于单个标识符,Wikidata和Globi链接。该方法导致Globi中添加119,957个Wikidata链接,增加了Globi中所有传出名称链接的13.7%。将Wikidata和Globi进行了比较,以打开生命参考分类树,以检查一致性和覆盖范围。解析Wikidata的过程,在Guoda平台上几分钟内完成了vikidata,打开生命参考树分类和Globi档案和计算一致性指标。作为模范协作,郭多有可能通过将多样化的技术良好的人与笔记本电脑或桌面可访问的高性能计算资源一起培养多样化的技术态度,拓展生物多样性科学。但是,参与这种协作仍需要基本的编程技巧。

著录项

相似文献

  • 外文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号