首页> 外文OA文献 >Repository of NSF Funded Publications and Data Sets: 'Back of Envelope' 15 year Cost Estimate
【2h】

Repository of NSF Funded Publications and Data Sets: 'Back of Envelope' 15 year Cost Estimate

机译:NSF资助的出版物和数据集的存储库:“封底”的15年成本估算

摘要

In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years.ud udThe architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately $.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of $150 - $100 per new dataset for up-front curation, or $4.87 – $3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts.ud udThe total projected cost of the data and paper repository is estimated at $167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at $5.56. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR).ud udAfter 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation.
机译:在这封信的背景研究中,我们计算了建立和运行数据存储库(或数据库)以存储和服务由国家科学基金会(NSF)资助的研究得出的出版物和数据集的15年固定和可变成本。费用是使用每年对NSF列为其资助机构的论文的固定估计数,每年进行计算的。我们假设每篇论文都有一个数据集,并根据经验估算该数据集的大小。根据我们的估计,每年产生的论文数量为64,340。 NSF的所有七个局的平均数据集大小为32 GB。每年向存储库中添加的数据总量为两个PB(PB),或者在15年中总计为30 PB。 ud ud数据/纸质存储库的体系结构基于分层存储模型,该模型使用快速磁盘的组合用于快速访问的磁带和用于高可靠性和经济高效的长期存储的磁带。数据是通过大学机构存储库中使用的工作流摄取的,这些工作流会添加元数据并确保数据完整性。在15年内,平均固定成本约为$ .0.90 / GB。可变成本的估算规模为每个新数据集150美元-100美元(用于预先管理),或4.87美元-3.22美元/ GB。可变成本反映出策展成本每年减少3%,因为效率和自动元数据以及物产捕获预计将有助于减少目前主要由人工策展的工作。 ud ud数据和纸质资料库的总预计成本估计为167,000,000美元,运营15年,策划了近一百万个数据集和一百万篇论文。经过15年和30 PB的数据积累和整理,我们估计每GB的成本为5.56美元。这笔1.67亿美元的成本是直接成本,因为其中不包括联邦政府允许的间接成本回报(ICR)。 ud ud在15年后,可以合理地假设某些数据集将被压缩并且很少访问。可能认为其他值不再有价值,例如,因为其他值已被更准确的结果代替。因此,在某些时候,将需要通过使用策略性保存来调整存储库中的数据增长。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号