首页> 外文会议>2018 IEEE International Congress on Big Data >Incorporating Word Embedding into Cross-Lingual Topic Modeling
【24h】

Incorporating Word Embedding into Cross-Lingual Topic Modeling

机译:将单词嵌入纳入跨语言主题建模

获取原文
获取原文并翻译 | 示例

摘要

In this paper, we address the cross-lingual topic modeling, which is an important technique that enables global enterprises to detect and compare topic trends across global markets. Previous works in cross-lingual topic modeling have proposed methods that utilize parallel or comparable corpus in constructing the polylingual topic model. However, parallel or comparable corpus in many cases are not available. In this research, we incorporate techniques of mapping cross-lingual word space and the topic modeling (LDA) and propose two methods: Translated Corpus with LDA (TC-LDA) and Post Match LDA (PM-LDA). The cross-lingual word space mapping allows us to compare words of different languages, and LDA enables us to group words into topics. Both TC-LDA and PM-LDA do not need parallel or comparable corpus and hence have more applicable domains. The effectiveness of both methods is evaluated using UM-Corpus and WS-353. Our evaluation results indicate that both methods are able to identify similar documents written in different language. In addition, PM-LDA is shown to achieve better performance than TC-LDA, especially when document length is short.
机译:在本文中,我们讨论了跨语言主题建模,这是使全球企业能够检测和比较全球市场主题趋势的一项重要技术。跨语言主题建模的先前工作提出了利用并行或可比语料库构建多语言主题模型的方法。但是,在许多情况下,并行或可比较的语料库不可用。在这项研究中,我们结合了跨语言单词空间的映射技术和主题建模(LDA),并提出了两种方法:带LDA的翻译语料库(TC-LDA)和赛后LDA(PM-LDA)。跨语言单词空间映射使我们能够比较不同语言的单词,而LDA使我们能够将单词分组为主题。 TC-LDA和PM-LDA都不需要并行或类似的语料库,因此具有更多适用域。使用UM-Corpus和WS-353评估了这两种方法的有效性。我们的评估结果表明,这两种方法都可以识别以不同语言编写的相似文档。此外,显示PM-LDA比TC-LDA具有更好的性能,尤其是在文档长度较短的情况下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号