Analysis of a Parallel/Distributed Application Using a Cycle-Accurate Parallel/Distributed Simulator

机译：使用周期精确的并行/分布式模拟器分析并行/分布式应用程序

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

It is important for computer architects to have a good understanding about the applications running on the designed hardware. That is to optimize their designs and run the applications more efficiently. Designing processors for accelerating big-data and cloud applications is a hot research topic. Currently, there are only a few papers on the analysis/characterizations of emerging big-data and cloud applications. Although these studies reveal the inefficiencies in a processor micro-architecture running big-data applications, they have been conducted using real-hardware, limiting the scope and flexibility of the analysis. In this paper, we aim to characterize a big-data workload using a novel method to simulate a distributed system and optimize an out of order core for running the cloud applications. dist-gem5, is a parallel and distributed version of gem5 which allows us to efficiently simulate a large scale distributed system on a cluster. Using dist-gem5, we aim to identify the bottlenecks and inefficiencies in the server processors and their overall system architecture, cutting across off-chip network stack, operating systems, Ethernet devices and core microarchitecture. Frist, we compare the results of a set of cloud application against SPECint2006 results. Our results show that BigData workloads, compared to SPECInt, has ~3× and ~4× more instruction cache miss rate and branch miss prediction rate, respectively. Next, we pick Memcached, as a representative BigData workload, and analyze how its performance and power scales with more cores under different request rates and core microarchitectures. Interestingly, we find out that having more cores on a chip does not bring more performance even for a parallel application like Memcached. A quad core ARM-v7 chip can have up to 6.5× longer average request latency compared to a single core ARM-v7 chip. We find that L2 cache architecture is the bottleneck in the ARM-v7 multi-core system and fixing that can make the performance of an embedded core as good as a high performance O3 core running Memcached server.

机译：对于计算机架构师而言，对在设计的硬件上运行的应用程序有一个很好的了解是很重要的。那就是优化他们的设计并更有效地运行应用程序。设计用于加速大数据和云应用程序的处理器是一个热门的研究主题。当前，关于新兴大数据和云应用程序的分析/特征的论文很少。尽管这些研究表明运行大数据应用程序的处理器微体系结构效率低下，但它们是使用真实硬件进行的，从而限制了分析的范围和灵活性。在本文中，我们旨在使用一种新颖的方法来表征大数据工作负载，以模拟分布式系统并优化用于运行云应用程序的乱序内核。 dist-gem5是gem5的并行和分布式版本，它使我们能够有效地模拟集群上的大规模分布式系统。我们使用dist-gem5来确定服务器处理器及其整体系统架构中的瓶颈和低效率，跨越片外网络堆栈，操作系统，以太网设备和核心微体系结构。首先，我们将一组云应用程序的结果与SPECint2006结果进行了比较。我们的结果表明，与SPECInt相比，BigData工作负载的指令缓存未命中率和分支未命中预测率分别高出约3倍和约4倍。接下来，我们选择Memcached作为具有代表性的BigData工作负载，并分析其性能和功率如何在不同的请求率和核心微体系结构下扩展更多核心。有趣的是，我们发现，即使对于像Memcached这样的并行应用程序，在芯片上拥有更多的内核也不会带来更多的性能。与单核ARM-v7芯片相比，四核ARM-v7芯片可具有多达6.5倍的平均请求延迟。我们发现L2缓存体系结构是ARM-v7多核系统中的瓶颈，并且修复可以使嵌入式内核的性能与运行Memcached服务器的高性能O3内核一样好。

著录项

来源
《Iranian Conference on Electrical Engineering》|2018年|1523-1529|共7页
会议地点
作者
Mohammad Zaman Ataie; Omid Elahi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Servers; Multicore processing; Program processors; Computational modeling; Hardware; Microarchitecture;

机译：服务器;多核处理;程序处理器;计算建模;硬件;微体系结构;

相似文献

外文文献
中文文献
专利

1. P-GAS: Parallelizing a Cycle-Accurate Event-Driven Many-Core Processor Simulator Using Parallel Discrete Event Simulation [J] . Yuan Cheng, Lu Bai, Mingyu Chen, Proceedings of the Workshop on Principles of Advanced and Distributed Simulation . 2010,第Null期

机译：P-GAS：使用并行离散事件模拟并行化周期精确的事件驱动多核处理器模拟器
2. A scalable parallel black oil simulator on distributed memory parallel computers [J] . Wang Kun, Liu Hui, Chen Zhangxin Journal of Computational Physics . 2015,第Null期

机译：分布式内存并行计算机上的可扩展并行黑油模拟器
3. Parallel DEVS: A parallel, hierarchical, modular modeling formalism and its distributed simulator [J] . A. C.-H. Chow Transactions of the Society for Modeling and Simulation International . 1996,第2期

机译：并行DEVS：并行，分层，模块化的建模形式及其分布式模拟器
4. Analysis of a Parallel/Distributed Application Using a Cycle-Accurate Parallel/Distributed Simulator [C] . Mohammad Zaman Ataie, Omid Elahi Iranian Conference on Electrical Engineering . 2018

机译：使用循环准确的并行/分布式模拟器分析并行/分布式应用程序
5. What broke where for distributed and parallel applications---a whodunit story. [D] . Mitra, Subrata. 2016

机译：分布式和并行应用程序的破败之处-杂乱无章的故事。
6. ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers [O] . Francois Besnier, Kevin A. Glover -1

机译：ParallelStructure：一种R软件包用于在多核计算机上分发种群遗传学程序STRUCTURE的并行运行
7. P-GAS: Parallelizing a Cycle-Accurate Event-Driven Many-Core Processor Simulator Using Parallel Discrete Event Simulation [O] . 2015

机译：p-Gas：使用并行离散事件仿真并行化循环精确事件驱动的多核处理器模拟器
8. Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix C. Fault Tolerant Interconnection Networks and Image Processing Applications for the PASM Parallel Processing Systems [R] . Adams, G. B. 1984

机译：信号处理的分布式计算：异步并行计算的建模。附录C. pasm并行处理系统的容错互连网络和图像处理应用

Analysis of a Parallel/Distributed Application Using a Cycle-Accurate Parallel/Distributed Simulator

摘要

著录项

相似文献

相关主题

期刊订阅