ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Bo Zhao; Hucheng Zhou; Guoqiang Li; Yihua Huang

首页> 中文期刊> 《大数据挖掘与分析(英文)》 >ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相关主题

摘要

Recently, topic models such as Latent Dirichlet Allocation（LDA） have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:（1） it converts the commonly used serial Collapsed Gibbs Sampling（CGS） inference algorithm to a Monte-Carlo Collapsed Bayesian（MCCB） estimation method, which is embarrassingly parallel;（2）it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;（3） it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

著录项

来源
《大数据挖掘与分析(英文)》 |2018年第1期|57-74|共18页
作者
Bo Zhao; Hucheng Zhou; Guoqiang Li; Yihua Huang;
展开▼
作者单位

1. the National Key Laboratory for Novel Software Technology;

Nanjing University 2. Collaborative Innovation Center of Novel Software Technology and Industrialization 3. Microsoft Research 4. Huawei Technologies Co.;

Ltd.;

展开▼
原文格式 PDF
正文语种 chi
中图分类文字信息处理;
关键词

机译：潜在Dirichlet分配;折叠Gibbs采样;Monte-Carlo;图计算;大规模机器学习;

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

摘要

著录项

相关主题

期刊订阅