首页> 外文期刊>Concurrency and computation: practice and experience >Provenance and data differencing for workflow reproducibility analysis
【24h】

Provenance and data differencing for workflow reproducibility analysis

机译:种源和数据差异用于工作流可重复性分析

获取原文
获取原文并翻译 | 示例

摘要

One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services – often choreographed through workflow, process data to generate results. The reproduction of results is often not straightforward as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. This paper addresses these problems in three ways. Firstly, it introduces a new framework to clarify the range of meanings of ‘reproducibility’. Secondly, it describes a new algorithm, PDIFF, that uses a comparison of workflow provenance traces to determine whether an experiment has been reproduced; the main innovation is that if this is not the case then the specific point(s) of divergence are identified through graph analysis, assisting any researcher wishing to understand those differences. One key feature is support for user-defined, semantic data comparison operators. Finally, the paper describes an implementation of PDIFF that leverages the power of the e-Science Central platform that enacts workflows in the cloud. As well as automatically generating a provenance trace for consumption by PDIFF, the platform supports the storage and reuse of old versions of workflows, data and services; the paper shows how this can be powerfully exploited to achieve reproduction and reuse. Copyright © 2013 John Wiley & Sons, Ltd.
机译:科学的基础之一是研究人员必须发布用于实现其结果的方法,以便其他人可以尝试复制它们。这具有允许将方法采用和修改用于其他目的的额外好处。在电子科学领域,服务(通常通过工作流程进行编排)处理数据以生成结果。结果的再现通常不是很简单,因为自从生成结果以来,计算对象可能变得不可用或可能已被更新。例如,通常会更新服务以修复错误或改进算法。本文以三种方式解决这些问题。首先,它引入了一个新的框架来阐明“可再现性”的含义范围。其次,它描述了一种新算法PDIFF,该算法使用工作流出处跟踪的比较来确定是否已复制实验。主要的创新之处在于,如果不是这种情况,则可以通过图形分析来确定特定的差异点,从而帮助希望了解这些差异的任何研究人员。一种主要功能是支持用户定义的语义数据比较运算符。最后,本文描述了PDIFF的实现,该实现利用了e-Science Central平台的功能,该平台在云中制定了工作流。该平台不仅可以自动生成出处痕迹供PDIFF使用,而且还支持存储和重用旧版本的工作流,数据和服务。本文展示了如何有效地利用它来实现复制和重用。版权所有©2013 John Wiley&Sons,Ltd.

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号