首页> 外文会议>2012 15th IEEE International Multitopic Conference >Temperature based fault forecasting in computer clusters
【24h】

Temperature based fault forecasting in computer clusters

机译:计算机集群中基于温度的故障预测

获取原文
获取原文并翻译 | 示例

摘要

Clusters and Grids have one thing common and that is they both are used to achieve High Performance in Computing. The scope of Cluster is relatively narrow compared to Grid, as Clusters are homogeneous while Grids are heterogeneous. Another emerging area in High Performance Computing (HPC) is Cloud computing that can be considered as a further extension of Grid computing. Apart from other issues that exist in Clusters, Grids and Clouds, there is one common problem or issue that is available in all of them and that is Fault Tolerance and Handling. Fault Tolerance is the technique or the set of techniques that are used when different types of Hardware, Software, Network and other types of problems come during the handling and execution of Clusters, Grids and Clouds. In this research we have focused on fault identification and forecasting from Clusters point of view and have tried to establish a technique that forecasts the faults in Clusters based environments on the basis of temperature. Nodes keep on receiving and monitoring the temperature of the attached devices from temperature sensor and check the temperature threshold values of those devices. If the temperature threshold value of devices is within the range than we place/rate the machine in Green zone. Similarly if temperatures are approaching threshold values then we place the machines in Orange zone that represents that machine may or may not crash on the basis of temperature. Similarly when the devices have crossed the threshold values of the temperature then we place the machine in Red zone that represents that machine is likely to fail due to the failure of one or more hardware devices any time.
机译:集群和网格有一个共同点,那就是它们都用于实现高性能的计算。与网格相比,群集的范围相对狭窄,因为群集是同质的,而网格是异构的。高性能计算(HPC)的另一个新兴领域是云计算,可以将其视为网格计算的进一步扩展。除了集群,网格和云中存在的其他问题之外,还有一个常见的问题或所有问题都可用,即容错和处理。容错是在群集,网格和云的处理和执行期间出现不同类型的硬件,软件,网络和其他类型的问题时使用的一种或多种技术。在这项研究中,我们从集群的角度着眼于故障识别和预测,并试图建立一种基于温度来预测基于集群的环境中的故障​​的技术。节点继续从温度传感器接收和监视连接的设备的温度,并检查那些设备的温度阈值。如果设备的温度阈值在该范围内,则我们将机器放置/评估为“绿色”区域。同样,如果温度接近阈值,则将机器放置在橙色区域,该区域表示机器可能会或可能不会因温度而崩溃。同样,当设备超过温度阈值时,我们将机器置于红色区域,该区域表示该机器很可能由于任何时候一个或多个硬件设备的故障而发生故障。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号