首页> 外文会议>2012 19th International Conference on High Performance Computing >Fault tolerant parallel data-intensive algorithms
【24h】

Fault tolerant parallel data-intensive algorithms

机译:容错并行数据密集型算法

获取原文
获取原文并翻译 | 示例

摘要

Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.
机译:随着越来越多的内核减少了系统故障的平均时间,容错能力迅速成为高端和分布式计算中的关键问题。虽然检查点(包括MPI应用程序之类的并行程序的检查点)提供了一种通用的解决方案,但是这种方法的开销变得越来越难以接受。因此,基于算法的容错提供了一个很好的实用替代方案,尽管它不太通用。尽管已经针对许多应用程序研究了这种方法,但是对于越来越多的数据密集型并行应用程序,尚无基于算法的容错的现有工作。在本文中,我们提出了一种基于算法的容错解决方案,该解决方案可处理一类数据密集型算法的故障停止故障。我们将数据集划分为较小的数据块,然后在复制步骤中分发复制的块,以使两个处理器之间的最大数据交集保持最小。当发生多个故障时,这使我们的数据丢失降至最低。此外,我们的方法可以在发生故障后实现更好的负载平衡,并减少丢失数据的重新处理量。我们通过使用两种流行的并行数据挖掘算法k-means和apriori评估了我们的方法。我们证明了,当没有故障时,我们的方法的开销可以忽略不计,并且可以让我们优雅地处理不同数量的故障,以及在不同处理点的故障。我们还提供了我们的方法与基于MapReduce的容错解决方案的比较,并显示了在不存在和存在故障的情况下,我们的性能都优于Hadoop。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号