...
首页> 外文期刊>International Journal of Digital Curation >Revisiting the Data Lifecycle with Big Data Curation
【24h】

Revisiting the Data Lifecycle with Big Data Curation

机译:通过大数据策展重访数据生命周期

获取原文
           

摘要

As science becomes more data-intensive and collaborative, researchers increasingly use larger and more complex data to answer research questions. The capacity of storage infrastructure, the increased sophistication and deployment of sensors, the ubiquitous availability of computer clusters, the development of new analysis techniques, and larger collaborations allow researchers to address grand societal challenges in a way that is unprecedented. In parallel, research data repositories have been built to host research data in response to the requirements of sponsors that research data be publicly available. Libraries are re-inventing themselves to respond to a growing demand to manage, store, curate and preserve the data produced in the course of publicly funded research. As librarians and data managers are developing the tools and knowledge they need to meet these new expectations, they inevitably encounter conversations around Big Data. This paper explores definitions of Big Data that have coalesced in the last decade around four commonly mentioned characteristics: volume, variety, velocity, and veracity. We highlight the issues associated with each characteristic, particularly their impact on data management and curation. We use the methodological framework of the data life cycle model, assessing two models developed in the context of Big Data projects and find them lacking. We propose a Big Data life cycle model that includes activities focused on Big Data and more closely integrates curation with the research life cycle. These activities include planning, acquiring, preparing, analyzing, preserving, and discovering, with describing the data and assuring quality being an integral part of each activity. We discuss the relationship between institutional data curation repositories and new long-term data resources associated with high performance computing centers, and reproducibility in computational science. We apply this model by mapping the four characteristics of Big Data outlined above to each of the activities in the model. This mapping produces a set of questions that practitioners should be asking in a Big Data project
机译:随着科学越来越成为数据密集型和协作型,研究人员越来越多地使用更大,更复杂的数据来回答研究问题。存储基础设施的容量,传感器的复杂性和部署的增加,计算机集群的无处不在的可用性,新分析技术的开发以及更大范围的合作,使研究人员能够以前所未有的方式应对巨大的社会挑战。同时,为了响应发起人要求公开提供研究数据的要求,已经建立了研究数据存储库来托管研究数据。图书馆正在重新发明自己,以应对管理,存储,管理和保存在公共资助的研究过程中产生的数据的日益增长的需求。图书馆员和数据管理员在开发满足这些新期望所需的工具和知识时,不可避免地会遇到围绕大数据的对话。本文探讨了过去十年中已结合的四个大特征:大数据量,多样性,速度和准确性。我们重点介绍与每个特征相关的问题,尤其是它们对数据管理和管理的影响。我们使用数据生命周期模型的方法框架,评估了在大数据项目背景下开发的两个模型,发现它们缺乏。我们提出了一个大数据生命周期模型,该模型包括针对大数据的活动,并将策展与研究生命周期更紧密地集成在一起。这些活动包括计划,获取,准备,分析,保存和发现,其中描述数据和确保质量是每个活动的组成部分。我们讨论机构数据管理存储库和与高性能计算中心相关的新的长期数据资源之间的关系,以及计算科学中的可重复性。我们通过将上面概述的大数据的四个特征映射到模型中的每个活动来应用该模型。这种映射产生了从业人员在大数据项目中应该提出的一系列问题

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号