首页> 外文会议>Intelligent data engineering and automated learning-IDEAL 2011 >P~2LSA and P~2LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model
【24h】

P~2LSA and P~2LSA+: Two Paralleled Probabilistic Latent Semantic Analysis Algorithms Based on the MapReduce Model

机译:P〜2LSA和P〜2LSA +:基于MapReduce模型的两种并行概率潜在语义分析算法

获取原文
获取原文并翻译 | 示例

摘要

Two novel paralleled Probabilistic Latent Semantic Analysis (PLSA) algorithms based on the MapReduce model are proposed, which are P~2LSA and P~2LSA+, respectively. When dealing with a large-scale data set, P~2LSA and P~2LSA+ can improve the computing speed with the Hadoop platform. The Expectation-Maximization (EM) algorithm is often used in the traditional PLSA method to estimate two hidden parameter vectors, while the parallel PLSA is to implement the EM algorithm in parallel. The EM algorithm includes two steps: E-step and M-step. In P~2LSA, the Map function is adopted to perform the E-step and the Reduce function is adopted to perform the M-step. However, all the intermediate results computed in the E-step need to be sent to the M-step. Transferring a large amount of data between the E-step and the M-step increases the burden on the network and the overall running time. Different from P~2LSA, the Map function in P~2LSA+ performs the E-step and M-step simultaneously. Therefore, the data transferred between the E-step and M-step is reduced and the performance is improved. Experiments are conducted to evaluate the performances of P~2LSA and P~2LSA+. The data set includes 20000 users and 10927 goods. The speedup curves show that the overall running time decrease as the number of computing nodes increases.Also, the overall running time demonstrates that P~2LSA+ is about 3 times faster than P~2LSA.
机译:提出了两种基于MapReduce模型的新型并行概率潜在语义分析算法(PLSA),分别为P〜2LSA和P〜2LSA +。当处理大规模数据集时,P〜2LSA和P〜2LSA +可以通过Hadoop平台提高计算速度。在传统的PLSA方法中,经常使用期望最大算法(EM)来估计两个隐藏参数向量,而并行PLSA则是并行实现EM算法。 EM算法包括两个步骤:E步骤和M步骤。在P〜2LSA中,采用Map函数执行E步,采用Reduce函数执行M步。但是,在E步骤中计算出的所有中间结果都需要发送到M步骤。在E步骤和M步骤之间传输大量数据会增加网络负担和整个运行时间。与P〜2LSA不同,P〜2LSA +中的Map功能同时执行E步和M步。因此,减少了在E步骤和M步骤之间传送的数据,并且提高了性能。实验评估了P〜2LSA和P〜2LSA +的性能。数据集包括20000个用户和10927个商品。加速曲线表明,随着计算节点数量的增加,总体运行时间减少。此外,总体运行时间表明P〜2LSA +比P〜2LSA快3倍。

著录项

  • 来源
  • 会议地点 Norwich(GB);Norwich(GB)
  • 作者单位

    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;

    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;

    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;

    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;

    School of Engineering and Advanced Technology Massey University Palmerston North, New Zealand;

    State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;

  • 会议组织
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 计算机网络;
  • 关键词

    paralleled PLSA; PLSA; mapreduce;

    机译:并行PLSA; PLSA; mapreduce;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号