Study of cache performance in distributed environment for data processing

机译：数据处理分布式环境中缓存性能研究

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Processing data in distributed environment has found its application in many fields of science (Nuclear and Particle Physics (NPP), astronomy, biology to name only those). Efficiently transferring data between sites is an essential part of such processing. The implementation of caching strategies in data transfer software and tools, such as the Reasoner for Intelligent File Transfer (RIFT) being developed in the STAR collaboration, can significantly decrease network load and waiting time by reusing the knowledge of data provenance as well as data placed in transfer cache to further expand on the availability of sources for files and data-sets. Though, a great variety of caching algorithms is known, a study is needed to evaluate which one can deliver the best performance in data access considering the realistic demand patterns. Records of access to the complete data-sets of NPP experiments were analyzed and used as input for computer simulations. Series of simulations were done in order to estimate the possible cache hits and cache hits per byte for known caching algorithms. The simulations were done for cache of different sizes within interval 0.001 - 90% of complete data-set and low-watermark within 0-90%. Records of data access were taken from several experiments and within different time intervals in order to validate the results. In this paper, we will discuss the different data caching strategies from canonical algorithms to hybrid cache strategies, present the results of our simulations for the diverse algorithms, debate and identify the choice for the best algorithm in the context of Physics Data analysis in NPP. While the results of those studies have been implemented in RIFT, they can also be used when setting up cache in any other computational work-flow (Cloud processing for example) or managing data storages with partial replicas of the entire data-set.

机译：分布式环境中的处理数据已经发现其在许多科学领域（核和粒子物理（NPP），天文学，生物学仅为那些）的应用。有效地在站点之间传输数据是这种处理的重要组成部分。在星际协作中开发的数据传输软件和工具中缓存策略的实现，例如在星形协作中开发的智能文件传输（Rift），可以通过重用数据出处的知识以及放置数据来显着降低网络负载和等待时间在传输缓存中，进一步扩展文件和数据集的可用性。然而，众所周知，各种缓存算法是众所周知的，需要考虑到现实需求模式来评估哪一个可以在数据访问中提供最佳性能。分析了对NPP实验完整数据集的访问记录，并用作计算机模拟的输入。完成了一系列模拟，以估计用于已知缓存算法的每个字节的可能的缓存命中和缓存命中。在0.001-90％的完整数据集和低水印内的不同尺寸的缓存进行了模拟，在0-90％之内。数据访问的记录是从多个实验中的，并在不同的时间间隔内验证结果。在本文中，我们将讨论从Canonical算法到混合缓存策略的不同数据缓存策略，展示我们对不同算法的模拟的结果，辩论和确定了NPP中物理数据分析的上下文中最佳算法的选择。虽然这些研究的结果已经在RIFT中实施，但是当在任何其他计算工作流（例如云处理）中设置缓存时也可以使用它们，或者使用整个数据集的部分副本管理数据存储。

著录项

来源
《Workshop on Advanced Computing and Analysis Techniques in Physics Research》|2014年||共8页
会议地点
作者
Dzmitry Makatun; Jerome Lauret; Michal Sumbera;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 O562-532;
关键词
cache performance; distributed environment; data processing;

机译：缓存性能;分布式环境;数据处理;

相似文献

外文文献
中文文献
专利

1. Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Desktop Workstations [J] . Xiaosong Ma, Sudharshan S. Vazhkudai, Zhe Zhang Journal of grid computing . 2009,第4期

机译：改善数据可用性以提高访问性能：在分布式桌面工作站上缓存科学数据的研究
2. Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Desktop Workstations [J] . Xiaosong Ma, Sudharshan S. Vazhkudai, Zhe Zhang Journal of grid computing . 2009,第4期

机译：改善数据可用性以提高访问性能：在分布式桌面工作站上缓存科学数据的研究
3. Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Desktop Workstations [J] . Xiaosong Ma, Sudharshan S. Vazhkudai, Zhe Zhang Journal of Grid Computing . 2009,第4期

机译：改善数据可用性以提高访问性能：在分布式桌面工作站上缓存科学数据的研究
4. Study of cache performance in distributed environment for data processing [C] . Dzmitry Makatun, Jerome Lauret, Michal Sumbera Workshop on Advanced Computing and Analysis Techniques in Physics Research . 2014

机译：数据处理分布式环境中缓存性能研究
5. Enabling distributed radar data retrieval and processing in distributed collaborative adaptive sensing environments. [D] . Arias Velasco, Diego Mauricio. 2007

机译：在分布式协作自适应感测环境中启用分布式雷达数据检索和处理。
6. aRNApipe: a balanced efficient and distributed pipeline for processing RNA-seq data in high-performance computing environments [O] . Arnald Alonso, Brittany N Lasseigne, Kelly Williams, -1

机译：aRNApipe：一种平衡高效且分布式的管道用于在高性能计算环境中处理RNA-seq数据
7. Improving Data Availability for Better Access Performance: A Study on Caching Scientific Data on Distributed Desktop Workstations [O] . Xiaosong Ma, Sudharshan S. Vazhkudai, Zhe Zhang 2009

机译：改善数据可用性以提高访问性能：在分布式桌面工作站上缓存科学数据的研究

Study of cache performance in distributed environment for data processing

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅