首页> 外文期刊>International Journal of Digital Curation >Amplifying Data Curation Efforts to Improve the Quality of Life Science Data
【24h】

Amplifying Data Curation Efforts to Improve the Quality of Life Science Data

机译:扩大数据管理工作以提高生命科学数据的质量

获取原文
       

摘要

In the era of data science, datasets are shared widely and used for many purposes unforeseen by the original creators of the data. In this context, defects in datasets can have far reaching consequences, spreading from dataset to dataset, and affecting the consumers of data in ways that are hard to predict or quantify. Some form of waste is often the result. For example, scientists using defective data to propose hypotheses for experimentation may waste their limited wet lab resources chasing the wrong experimental targets. Scarce drug trial resources may be used to test drugs that actually have little chance of giving a cure. Because of the potential real world costs, database owners care about providing high quality data. Automated curation tools can be used to an extent to discover and correct some forms of defect. However, in some areas human curation, performed by highly-trained domain experts, is needed to ensure that the data represents our current interpretation of reality accurately. Human curators are expensive, and there is far more curation work to be done than there are curators available to perform it. Tools and techniques are needed to enable the full value to be obtained from the curation effort currently available. In this paper,we explore one possible approach to maximising the value obtained from human curators, by automatically extracting information about data defects and corrections from the work that the curators do. This information is packaged in a source independent form, to allow it to be used by the owners of other databases (for which human curation effort is not available or is insufficient). This amplifies the efforts of the human curators, allowing their work to be applied to other sources, without requiring any additional effort or change in their processes or tool sets. We show that this approach can discover significant numbers of defects, which can also be found in other sources.
机译:在数据科学时代,数据集被广泛共享,并且被原始数据创建者无法预料的许多目的使用。在这种情况下,数据集中的缺陷可能会产生深远的影响,在各个数据集中扩散,并以难以预测或量化的方式影响数据的使用者。通常会导致某种形式的浪费。例如,科学家使用有缺陷的数据提出实验假设,可能会浪费有限的湿实验室资源来追求错误的实验目标。稀有的药物试验资源可用于测试实际上几乎没有机会治愈的药物。由于现实世界中潜在的成本,数据库所有者关心提供高质量的数据。自动化管理工具可以在一定程度上用于发现和纠正某些形式的缺陷。但是,在某些领域,需要由训练有素的领域专家进行人工管理,以确保数据能够准确地代表我们当前对现实的解释。人类策展人很昂贵,与要执行的策展人相比,要做的策展工作要多得多。需要使用工具和技术来从目前可用的策展工作中获得全部价值。在本文中,我们通过从策展人所做的工作中自动提取有关数据缺陷和更正的信息,探索了一种最大化从策展人那里获得的价值的可能方法。此信息以独立于源的形式打包,以允许其他数据库的所有者使用(对于这些数据库,人工管理工作不可用或不足)。这扩大了人类策展人的工作量,使他们的工作可以应用于其他资源,而无需任何额外的工作或更改其过程或工具集。我们证明了这种方法可以发现大量缺陷,也可以在其他来源中找到。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号