首页> 外文会议>IEEE International Symposium on Workload Characterization >Workload Characterization of a Parallel Video Mining Application on a 16-Way Shared-Memory Multiprocessor System
【24h】

Workload Characterization of a Parallel Video Mining Application on a 16-Way Shared-Memory Multiprocessor System

机译:在16路共享存储器多处理器系统上并行视频挖掘应用程序的工作负载表征

获取原文

摘要

As video data become more and more pervasive, mining information from multimedia data sources becomes increasingly important, e.g., automatically extracting highlights from soccer game video content. However, the huge computation requirement of mining interested data limits its wide use in practice. Since the hardware imperative behind computer architecture is shifting from uniprocessors to multi-core processors, exploiting thread-level parallelism existing in multimedia mining applications is critical to utilizing the hardware resources and accelerating the complex processing of highlight events detection. In this paper we analyze the view type and play field detection application, a widely used application in sports video mining systems, and we present several different schemes (task level, data-slicing-level, and a hybrid parallel scheme, as well as variations of the hybrid parallel scheme) for parallelizing this application. The hybrid parallel scheme, which exploits data-level and task-slicing-level parallelism, outperforms basic task-level and data-slicing-level schemes, delivering much better performance in terms of execution time and speedup. On a 16-way shared-memory multi-processing system with hardware prefetch enabled, the hybrid scheme achieves a speedup of 10.6x. Detailed performance analysis shows that because of the large working set, the workload often requires data from the off-chip memory. Therefore, the saturated bus bandwidth utilization is the likely cause of bottlenecks for achieving perfect scalability performance. With hardware prefetch enabled, the bus utilization rate on 16-processors system is about 76% for the hybrid scheme, and the projected bus bandwidth requirement for perfect scalability is about 3.1GB/s for 16 processors and 6.2GB/s for 32 processors. In addition, our experiments also reveal that there are also no obvious scaling limiting factors, e.g., very low synchronization and load imbalance problems even with up to 16 processors.
机译:随着视频数据变得越来越普遍的,来自多媒体数据源的挖掘信息变得越来越重要,例如,自动从足球游戏视频内容中提取亮点。但是,挖掘的巨额计算要求受感兴趣的数据限制了其在实践中的广泛使用。由于计算机架构背后的硬件从单处理转换到多核处理器,因此利用多媒体挖掘应用程序的线程并行性对于利用硬件资源和加速突出显示事件检测的复杂处理至关重要。在本文中,我们分析了视图类型和播放现场检测应用,这是在体育视频挖掘系统中广泛应用的应用,我们呈现了几种不同的方案(任务级别,数据切片级别和混合并行方案,以及变体混合并行方案)并行化本申请。混合并行方案,利用数据级别和任务切片级并行度,优于基本任务级和数据切片级方案,在执行时间和加速方面提供更好的性能。在具有硬件预取的16位共用内存多处理系统上,混合动力车方案实现了10.6倍的加速。详细的性能分析表明,由于工作集的较大,工作负载通常需要来自片外存储器的数据。因此,饱和总线带宽利用率是实现完美可扩展性性能的瓶颈的可能原因。通过启用硬件预取,16处理器系统上的总线利用率约为混合体方案的76%,具有完美可扩展性的预计总线带宽要求为16个处理器和6.2GB / s的32个处理器。此外,我们的实验还揭示了也没有明显的缩放限制因素,例如,即使在最多16个处理器也是非常低的同步和负载不平衡问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号