首页> 外文会议> >FREERIDE-G: Supporting Applications that Mine Remote FREERIDE-G: Supporting Applications that Mine Remote
【24h】

FREERIDE-G: Supporting Applications that Mine Remote FREERIDE-G: Supporting Applications that Mine Remote

机译:FREERIDE-G:支持远程开采的应用程序FREERIDE-G:支持远程开采的应用程序

获取原文

摘要

Analysis of large geographically distributed scientific datasets, also referred to as distributed data-intensive science, has emerged as an important area in recent years. An application that processes data from a remote repository needs to be broken into several stages, including a data retrieval task at the data repository, a data movement task, and a data processing task at a computing site. Because of the volume of data that is involved and the amount of processing, it is desirable that both the data repository and computing site may be clusters. This can further complicate the development of such data processing applications. In this paper, we present a middleware, FREERIDE-G (framework for rapid implementation of datamining engines in grid), which support a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. Particularly, we had the following goals behind designing the FREERIDE-G middleware: 1) support high-end processing, i.e., use parallel configurations for both hosting the data and processing the data, 2) ease use of parallel configurations, i.e., support a high-level API for specifying the processing, and 3) hide details of data movement and caching. We have evaluated our system using three popular data mining algorithms and two scientific data analysis applications. The main observations from our experiments are as follows. First, FREERIDE-G is able to scale the processing extremely well when the number of data server and compute nodes are scaled evenly. Second, when only the number of compute nodes are scaled, our target class of applications achieve modest additional speedups. Finally, for applications that involve multiple passes on the dataset, caching remote data provides significant improvement
机译:近年来,大型地理分布科学数据集(也称为分布式数据密集型科学)的分析已成为重要领域。处理来自远程存储库的数据的应用程序需要分为几个阶段,包括数据存储库中的数据检索任务,数据移动任务和计算站点中的数据处理任务。由于涉及的数据量和处理量大,因此希望数据存储库和计算站点都可以是群集。这会使这种数据处理应用程序的开发进一步复杂化。在本文中,我们提出了一种中间件,即FREERIDE-G(用于在网格中快速实现数据挖掘引擎的框架),该中间件支持用于开发数据挖掘和科学数据处理应用程序的高层接口,这些应用程序涉及存储在远程存储库中的数据。特别是,在设计FREERIDE-G中间件后,我们有以下目标:1)支持高端处理,即使用并行配置来托管数据和处理数据; 2)简化并行配置的使用,即支持用于指定处理的高级API,以及3)隐藏数据移动和缓存的详细信息。我们已经使用三种流行的数据挖掘算法和两个科学数据分析应用程序评估了我们的系统。我们的实验的主要观察结果如下。首先,当数据服务器和计算节点的数量均匀扩展时,FREERIDE-G能够非常好地扩展处理能力。其次,当仅扩展计算节点的数量时,我们的目标应用程序类别将实现适度的额外加速。最后,对于涉及数据集多次遍历的应用程序,缓存远程数据可显着改善

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号