首页> 外文会议>IEEE International Congress on Big Data >Incorporating Word Embedding into Cross-Lingual Topic Modeling
【24h】

Incorporating Word Embedding into Cross-Lingual Topic Modeling

机译:将单词嵌入到跨语言主题建模中

获取原文

摘要

In this paper, we address the cross-lingual topic modeling, which is an important technique that enables global enterprises to detect and compare topic trends across global markets. Previous works in cross-lingual topic modeling have proposed methods that utilize parallel or comparable corpus in constructing the polylingual topic model. However, parallel or comparable corpus in many cases are not available. In this research, we incorporate techniques of mapping cross-lingual word space and the topic modeling (LDA) and propose two methods: Translated Corpus with LDA (TC-LDA) and Post Match LDA (PM-LDA). The cross-lingual word space mapping allows us to compare words of different languages, and LDA enables us to group words into topics. Both TC-LDA and PM-LDA do not need parallel or comparable corpus and hence have more applicable domains. The effectiveness of both methods is evaluated using UM-Corpus and WS-353. Our evaluation results indicate that both methods are able to identify similar documents written in different language. In addition, PM-LDA is shown to achieve better performance than TC-LDA, especially when document length is short.
机译:在本文中,我们解决了交叉语言主题建模,这是一种重要的技术,使全球企业能够检测和比较全球市场的主题趋势。以前的跨语言主题建模的作品已经提出了利用平行或可比语料库构建折叠主题模型的方法。然而,许多情况下并行或可比语料库不可用。在这项研究中,我们纳入了映射跨语言词空间和主题建模(LDA)的技术,并提出了两种方法:用LDA(TC-LDA)翻译语料库和匹配LDA(PM-LDA)。交叉语言单词空间映射允许我们比较不同语言的语言,而LDA使我们能够将单词分组为主题。 TC-LDA和PM-LDA都不需要并行或可比语料库,因此具有更适用的域。使用UM-CORPU和WS-353评估两种方法的有效性。我们的评估结果表明,两种方法都能够识别以不同语言编写的类似文件。此外,PM-LDA显示用于达到比TC-LDA更好的性能,尤其是当文档长度短时。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号