首页> 外文会议>2012 19th International Conference on High Performance Computing >Distributed hierarchical co-clustering and collaborative filtering algorithm
【24h】

Distributed hierarchical co-clustering and collaborative filtering algorithm

机译:分布式分层共聚协同过滤算法

获取原文
获取原文并翻译 | 示例

摘要

Petascale Analytics is a hot research area both in academia and industry. It envisages processing massive amounts of data at extremely high rates to generate new scientific insights along with positive impact (for both users and providers) of industries such as E-commerce, Telecom, Finance, Life Sciences and so forth. We consider collaborative filtering (CF) and Clustering algorithms that are key fundamental analytics kernels that help in achieving these aims. Real-time CF and co-clustering on highly sparse massive datasets, while achieving a high prediction accuracy, is a computationally challenging problem. In this paper, we present a novel hierarchical design for soft real-time (less than 1 minute.) distributed co-clustering based collaborative filtering algorithm. Our distributed algorithm has been optimized for multi-core cluster architectures. Theoretical analysis of the time complexity of our algorithm proves the efficacy of our approach. Using the Netflix dataset (900M training ratings with replication) as well as the Yahoo KDD Cup 1 (4.6B training ratings with replication) datasets, we demonstrate the performance and scalability of our algorithm on a 4096-node multi-core cluster architecture. Our distributed algorithm (implemented using OpenMP with MPI) demonstrates around 4x better performance (on Blue Gene/P) as compared to the best prior work, along with high accuracy (26 ± 4 RMSE for Yahoo KDD Cup data and 0.87 ± 0.02 for Netflix data). To the best of our knowledge, these are the best known performance results for collaborative filtering, at high prediction accuracy, for multi-core cluster architectures.
机译:Petascale Analytics是学术界和工业界的热门研究领域。它设想以极高的速率处理大量数据,以产生新的科学见解以及对电子商务,电信,金融,生命科学等行业的积极影响(对用户和提供者而言)。我们认为协作过滤(CF)和集群算法是帮助实现这些目标的关键基础分析内核。在高度稀疏的海量数据集上进行实时CF和共聚,同时实现较高的预测精度,是一个计算难题。在本文中,我们提出了一种新颖的基于软实时(少于1分钟)的分布式共聚协作过滤算法的分层设计。我们的分布式算法已针对多核集群体系结构进行了优化。对算法时间复杂度的理论分析证明了该方法的有效性。使用Netflix数据集(具有复制的900M训练等级)和Yahoo KDD Cup 1(具有复制的4.6B训练等级)数据集,我们展示了我们的算法在4096节点多核群集体系结构上的性能和可伸缩性。我们的分布式算法(使用带有MPI的OpenMP实施)与以前的最佳工作相比,展示了大约4倍的性能(在Blue Gene / P上),并具有很高的精度(Yahoo KDD Cup数据为26±4 RMSE,Netflix为0.87±0.02)数据)。据我们所知,这些是多核集群体系结构中以高预测精度进行协作过滤的最著名的性能结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号