首页> 外文会议>Iranian Conference on Electrical Engineering >Analysis of a Parallel/Distributed Application Using a Cycle-Accurate Parallel/Distributed Simulator
【24h】

Analysis of a Parallel/Distributed Application Using a Cycle-Accurate Parallel/Distributed Simulator

机译:使用周期精确的并行/分布式模拟器分析并行/分布式应用程序

获取原文

摘要

It is important for computer architects to have a good understanding about the applications running on the designed hardware. That is to optimize their designs and run the applications more efficiently. Designing processors for accelerating big-data and cloud applications is a hot research topic. Currently, there are only a few papers on the analysis/characterizations of emerging big-data and cloud applications. Although these studies reveal the inefficiencies in a processor micro-architecture running big-data applications, they have been conducted using real-hardware, limiting the scope and flexibility of the analysis. In this paper, we aim to characterize a big-data workload using a novel method to simulate a distributed system and optimize an out of order core for running the cloud applications. dist-gem5, is a parallel and distributed version of gem5 which allows us to efficiently simulate a large scale distributed system on a cluster. Using dist-gem5, we aim to identify the bottlenecks and inefficiencies in the server processors and their overall system architecture, cutting across off-chip network stack, operating systems, Ethernet devices and core microarchitecture. Frist, we compare the results of a set of cloud application against SPECint2006 results. Our results show that BigData workloads, compared to SPECInt, has ~3× and ~4× more instruction cache miss rate and branch miss prediction rate, respectively. Next, we pick Memcached, as a representative BigData workload, and analyze how its performance and power scales with more cores under different request rates and core microarchitectures. Interestingly, we find out that having more cores on a chip does not bring more performance even for a parallel application like Memcached. A quad core ARM-v7 chip can have up to 6.5× longer average request latency compared to a single core ARM-v7 chip. We find that L2 cache architecture is the bottleneck in the ARM-v7 multi-core system and fixing that can make the performance of an embedded core as good as a high performance O3 core running Memcached server.
机译:对于计算机架构师而言,对在设计的硬件上运行的应用程序有一个很好的了解是很重要的。那就是优化他们的设计并更有效地运行应用程序。设计用于加速大数据和云应用程序的处理器是一个热门的研究主题。当前,关于新兴大数据和云应用程序的分析/特征的论文很少。尽管这些研究表明运行大数据应用程序的处理器微体系结构效率低下,但它们是使用真实硬件进行的,从而限制了分析的范围和灵活性。在本文中,我们旨在使用一种新颖的方法来表征大数据工作负载,以模拟分布式系统并优化用于运行云应用程序的乱序内核。 dist-gem5是gem5的并行和分布式版本,它使我们能够有效地模拟集群上的大规模分布式系统。我们使用dist-gem5来确定服务器处理器及其整体系统架构中的瓶颈和低效率,跨越片外网络堆栈,操作系统,以太网设备和核心微体系结构。首先,我们将一组云应用程序的结果与SPECint2006结果进行了比较。我们的结果表明,与SPECInt相比,BigData工作负载的指令缓存未命中率和分支未命中预测率分别高出约3倍和约4倍。接下来,我们选择Memcached作为具有代表性的BigData工作负载,并分析其性能和功率如何在不同的请求率和核心微体系结构下扩展更多核心。有趣的是,我们发现,即使对于像Memcached这样的并行应用程序,在芯片上拥有更多的内核也不会带来更多的性能。与单核ARM-v7芯片相比,四核ARM-v7芯片可具有多达6.5倍的平均请求延迟。我们发现L2缓存体系结构是ARM-v7多核系统中的瓶颈,并且修复可以使嵌入式内核的性能与运行Memcached服务器的高性能O3内核一样好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号