Fault tolerant parallel data-intensive algorithms

机译：容错并行数据密集型算法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.

机译：随着越来越多的内核减少了系统故障的平均时间，容错能力迅速成为高端和分布式计算中的关键问题。虽然检查点（包括MPI应用程序之类的并行程序的检查点）提供了一种通用的解决方案，但是这种方法的开销变得越来越难以接受。因此，基于算法的容错提供了一个很好的实用替代方案，尽管它不太通用。尽管已经针对许多应用程序研究了这种方法，但是对于越来越多的数据密集型并行应用程序，尚无基于算法的容错的现有工作。在本文中，我们提出了一种基于算法的容错解决方案，该解决方案可处理一类数据密集型算法的故障停止故障。我们将数据集划分为较小的数据块，然后在复制步骤中分发复制的块，以使两个处理器之间的最大数据交集保持最小。当发生多个故障时，这使我们的数据丢失降至最低。此外，我们的方法可以在发生故障后实现更好的负载平衡，并减少丢失数据的重新处理量。我们通过使用两种流行的并行数据挖掘算法k-means和apriori评估了我们的方法。我们证明了，当没有故障时，我们的方法的开销可以忽略不计，并且可以让我们优雅地处理不同数量的故障，以及在不同处理点的故障。我们还提供了我们的方法与基于MapReduce的容错解决方案的比较，并显示了在不存在和存在故障的情况下，我们的性能都优于Hadoop。

著录项

来源
《2012 19th International Conference on High Performance Computing》|2012年|p.1-10|共10页
会议地点 Pune(IN)
作者
Kutlu Mucahid; Agrawal Gagan; Kurt Oguz;
展开▼
作者单位

Department of Computer Science and Engineering Ohio State University Columbus, OH, 43210;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Fault-tolerant parallel algorithms for adaptive matched-field processing on distributed array systems [J] . Cho K, George AD, Subramaniyan R Journal of computational acoustics . 2005,第4期

机译：分布式阵列系统中自适应匹配场处理的容错并行算法
2. Fault-tolerant parallel algorithms for adaptive matched-field processing on distributed array systems [J] . Cho K, George AD, Subramaniyan R Journal of computational acoustics . 2005,第4期

机译：分布式阵列系统中自适应匹配场处理的容错并行算法
3. A Parallel Route Assignment Algorithm for Fault-Tolerant Clos Networks in OTN Switches [J] . Wang Lingkang, Ye Tong, Lee Tony T. IEEE Transactions on Parallel and Distributed Systems . 2019,第5期

机译：OTN交换机中容错Clos网络的并行路由分配算法
4. Fault Tolerant Parallel Data-Intensive Algorithms [C] . Mucahid Kutlu, Gagan Agrawal, Oguz Kurt 21st ACM symposium on high-performance parallel distributed computing . 2012

机译：容错并行数据密集型算法
5. Parallel and sequential algorithms for the optimization and design of fault-tolerant nanoscale semiconductor devices. [D] . Oniciuc, Liviu G. 2009

机译：并行和顺序算法，用于优化和设计容错纳米级半导体器件。
6. Fault-Tolerant Algorithms for Connectivity Restoration in Wireless Sensor Networks [O] . Yali Zeng, Li Xu, Zhide Chen 2016

机译：无线传感器网络中连接恢复的容错算法
7. Easily rendering token-ring algorithms of distributed and parallel applications fault tolerant [O] . Arantes, Luciana, Sopena, Julien 2013

机译：轻松呈现分布式和并行应用程序的令牌环算法容错
8. Fault-Tolerant Parallel Algorithms for Adaptive Matched-Field Processing on Distributed Array Systems [R] . Cho, K. , George, A. D. , Subramaniyan, R. 2004

机译：分布式阵列系统自适应匹配场处理的容错并行算法

Fault tolerant parallel data-intensive algorithms

摘要

著录项

相似文献

相关主题

期刊订阅