首页> 外文会议>IEEE International Conference on e-Science >Datatrack: An R package for managing data in a multi-stage experimental workflow data versioning and provenance considerations in interactive scripting
【24h】

Datatrack: An R package for managing data in a multi-stage experimental workflow data versioning and provenance considerations in interactive scripting

机译:DataTrack:用于管理多级实验工作流数据版本控制中的数据的R包,并在交互式脚本中取消考虑

获取原文

摘要

In experimental research using computation, a workflow is a sequence of steps involving some data processing or analysis where the output of one step may be used as the input of another. The processing steps may involve user-supplied parameters, that when modified, result in a new version of input to the downstream steps, in turn generating new versions of their own output. As more experimentation is done, the results of these various steps can become numerous. It is important to keep track of which data output is dependent on which other generated data, and which parameters were used. In many situations, scientific workflow management systems solve this problem, but these systems are best suited to collaborative, distributed experiments using a variety of services, possibly batch processing parameter sweeps. This paper presents an R package for managing and navigating a network of interdependent data. It is intended as a lightweight tool that provides some visual data provenance information to the experimenter to allow them to manage their generated data as they run experiments within their familiar scripting environment, where it may not be desirable to commit to a fully-blown comprehensive workflow manager. The package consists of wrapper functions for writing and reading output data that can be called from within the R analysis scripts, as well as a visualization of the data-output dependency graph rendered within the R-studio console. Thus, it offers benefit to the experimenter while requiring minimal commitment for integration in their existing working environment.
机译:在使用计算的实验研究中,工作流程是涉及一些数据处理或分析的一系列步骤,其中一个步骤的输出可以用作另一个的输入。处理步骤可能涉及用户提供的参数,即在修改时,导致新版本的输入到下游步骤,反过来生成其自己的输出的新版本。随着更多的实验完成,这些各个步骤的结果可能变得众多。重要的是要跟踪哪些数据输出取决于哪些生成的数据以及使用哪些参数。在许多情况下,科学工作流管理系统解决了这个问题,但这些系统最适合使用各种服务的协作,分布式实验,可能批量处理参数扫描。本文介绍了用于管理和导航相互依存数据网络的R包。它旨在作为一种轻量级工具,为实验者提供一些可视数据的出处信息,以允许它们管理其生成的数据,因为它们在其熟悉的脚本环境中运行实验,在那里可能不希望提交完全吹入的综合工作流程经理。该软件包由包装器函数组成,用于写入和读取可以从R分析脚本中调用的输出数据,以及在R-Studio控制台中呈现的数据输出依赖性图的可视化。因此,它为实验者提供了益处,同时需要最小的致力于在其现有的工作环境中的整合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号