首页> 外文期刊>IEEE transactions on multimedia >Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems
【24h】

Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems

机译:减轻云计算系统中整数积和计算中的核心故障

获取原文
获取原文并翻译 | 示例
           

摘要

The decreasing mean-time-to-failure estimates in cloud computing systems indicate that multimedia applications running on such environments should be able to mitigate an increasing number of core failures at runtime. We propose a new failure-mitigation approach for integer sum-of-product computations, with emphasis on generic matrix multiplication (GEMM) and convolution/crosscorrelation (CONV) routines. Our approach is based on the production of redundant results the numerical representation of the outputs via the use of numerical packing. This differs from all existing roll-forward solutions that require a separate set of checksum (or duplicate) results. Our proposal imposes 37.5% reduction in the maximum output bitwidth supported in comparison to integer sum-of-product realizations performed on 32-bit integer representations which is comparable to the bitwidth requirement of checksum-methods for multiple core failure mitigation. Experiments with state-of-the-art GEMM and CONV routines running on a c4.8xlarge compute-optimized instance of amazon web services elastic compute cloud (AWS EC2) demonstrate that the proposed approach is able to mitigate up to one quadcore failure while achieving processing throughput that is: 1) comparable to that of the conventional, failure-intolerant, integer GEMM and CONV routines, 2) substantially superior to that of the equivalent roll-forward failure-mitigation method based on checksum streams. Furthermore, when used within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminatable) instances, our proposal leads to: 1) 16%–23% cost reduction against the equivalent checksum-based method and 2) more than 70% cost reduction against conventional failure-intolerant processing on AWS EC2 on-demand (i.e., higher-cost albeit guaranteed) instances.
机译:云计算系统中平均故障时间估计值的减少表明,在此类环境中运行的多媒体应用程序应能够缓解运行时不断增加的核心故障数。我们提出了一种用于整数乘积和计算的新的故障缓解方法,重点是通用矩阵乘法(GEMM)和卷积/互相关(CONV)例程。我们的方法基于冗余结果的产生,即通过使用数字打包来对输出进行数字表示。这不同于所有现有的前滚解决方案,这些解决方案需要单独的一组校验和(或重复)结果。与在32位整数表示上执行的整数乘积和实现相比,我们的建议将支持的最大输出位宽降低了37.5%,这可与缓解多个核心故障的校验和方法的位宽要求相媲美。在Amazon Web Services弹性计算云(AWS EC2)的c4.8xlarge计算优化实例上运行的最先进的GEMM和CONV例程的实验证明,该建议的方法能够减轻多达一个四核故障,同时实现处理吞吐量:1)与传统的,无故障的整数GEMM和CONV例程的吞吐量相当; 2)显着优于基于校验和流的等效前滚故障缓解方法的吞吐量。此外,当在部署在AWS EC2现货集群(即,虽然可以终止的低成本)实例上的图像检索框架中使用时,我们的建议导致:1)与基于校验和的等效方法相比,成本降低了16%–23%; 2)与按需AWS实例上的常规故障容错处理(即,尽管可以保证更高的成本)相比,成本降低了70%以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号