首页> 外文会议>Workshop on Advanced Computing and Analysis Techniques in Physics Research >Study of cache performance in distributed environment for data processing
【24h】

Study of cache performance in distributed environment for data processing

机译:数据处理分布式环境中缓存性能研究

获取原文

摘要

Processing data in distributed environment has found its application in many fields of science (Nuclear and Particle Physics (NPP), astronomy, biology to name only those). Efficiently transferring data between sites is an essential part of such processing. The implementation of caching strategies in data transfer software and tools, such as the Reasoner for Intelligent File Transfer (RIFT) being developed in the STAR collaboration, can significantly decrease network load and waiting time by reusing the knowledge of data provenance as well as data placed in transfer cache to further expand on the availability of sources for files and data-sets. Though, a great variety of caching algorithms is known, a study is needed to evaluate which one can deliver the best performance in data access considering the realistic demand patterns. Records of access to the complete data-sets of NPP experiments were analyzed and used as input for computer simulations. Series of simulations were done in order to estimate the possible cache hits and cache hits per byte for known caching algorithms. The simulations were done for cache of different sizes within interval 0.001 - 90% of complete data-set and low-watermark within 0-90%. Records of data access were taken from several experiments and within different time intervals in order to validate the results. In this paper, we will discuss the different data caching strategies from canonical algorithms to hybrid cache strategies, present the results of our simulations for the diverse algorithms, debate and identify the choice for the best algorithm in the context of Physics Data analysis in NPP. While the results of those studies have been implemented in RIFT, they can also be used when setting up cache in any other computational work-flow (Cloud processing for example) or managing data storages with partial replicas of the entire data-set.
机译:分布式环境中的处理数据已经发现其在许多科学领域(核和粒子物理(NPP),天文学,生物学仅为那些)的应用。有效地在站点之间传输数据是这种处理的重要组成部分。在星际协作中开发的数据传输软件和工具中缓存策略的实现,例如在星形协作中开发的智能文件传输(Rift),可以通过重用数据出处的知识以及放置数据来显着降低网络负载和等待时间在传输缓存中,进一步扩展文件和数据集的可用性。然而,众所周知,各种缓存算法是众所周知的,需要考虑到现实需求模式来评估哪一个可以在数据访问中提供最佳性能。分析了对NPP实验完整数据集的访问记录,并用作计算机模拟的输入。完成了一系列模拟,以估计用于已知缓存算法的每个字节的可能的缓存命中和缓存命中。在0.001-90%的完整数据集和低水印内的不同尺寸的缓存进行了模拟,在0-90%之内。数据访问的记录是从多个实验中的,并在不同的时间间隔内验证结果。在本文中,我们将讨论从Canonical算法到混合缓存策略的不同数据缓存策略,展示我们对不同算法的模拟的结果,辩论和确定了NPP中物理数据分析的上下文中最佳算法的选择。虽然这些研究的结果已经在RIFT中实施,但是当在任何其他计算工作流(例如云处理)中设置缓存时也可以使用它们,或者使用整个数据集的部分副本管理数据存储。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号