首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Understanding Errors in Approximate Distributed Latent Dirichlet Allocation
【24h】

Understanding Errors in Approximate Distributed Latent Dirichlet Allocation

机译:了解近似分布式潜在狄利克雷分配中的错误

获取原文
获取原文并翻译 | 示例
       

摘要

Latent Dirichlet allocation (LDA) is a popular algorithm for discovering semantic structure in large collections of text or other data. Although its complexity is linear in the data size, its use on increasingly massive collections has created considerable interest in parallel implementations. “Approximate distributed” LDA, or AD-LDA, approximates the popular collapsed Gibbs sampling algorithm for LDA models while running on a distributed architecture. Although this algorithm often appears to perform well in practice, its quality is not well understood theoretically or easily assessed on new data. In this work, we theoretically justify the approximation, and modify AD-LDA to track an error bound on performance. Specifically, we upper bound the probability of making a sampling error at each step of the algorithm (compared to an exact, sequential Gibbs sampler), given the samples drawn thus far. We show empirically that our bound is sufficiently tight to give a meaningful and intuitive measure of approximation error in AD-LDA, allowing the user to track the tradeoff between accuracy and efficiency while executing in parallel.
机译:潜在狄利克雷分配(LDA)是一种流行的算法,用于发现大量文本或其他数据中的语义结构。尽管它的复杂性在数据大小上是线性的,但它在越来越庞大的集合上的使用引起了人们对并行实现的极大兴趣。当在分布式体系结构上运行时,“近似分布式” LDA或AD-LDA近似用于LDA模型的流行的折叠Gibbs采样算法。尽管此算法在实践中通常看起来表现良好,但其质量在理论上并没有得到很好的理解,也很难根据新数据进行评估。在这项工作中,我们从理论上证明了这种近似的合理性,并修改了AD-LDA以跟踪性能上的误差范围。具体来说,给定到目前为止已抽取的样本,我们将在算法的每个步骤(与精确的顺序Gibbs样本器相比)中产生抽样错误的概率上限设定为上限。我们凭经验表明,边界足够紧密,可以在AD-LDA中提供有意义且直观的近似误差度量,允许用户在并行执行时跟踪准确性和效率之间的权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号