首页> 外文期刊>Data mining and knowledge discovery >C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content
【24h】

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

机译:C-BiLDA通过区分共享内容和非共享内容,从非平行文本中提取跨语言主题

获取原文
获取原文并翻译 | 示例
           

摘要

We study the problem of extracting cross-lingual topics from non-parallel multilingual text datasets with partially overlapping thematic content (e.g., aligned Wikipedia articles in two different languages). To this end, we develop a new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model (BiLDA), does not assume the availability of document pairs with identical topic distributions. We present a full overview of C-BiLDA, and show its utility in the task of cross-lingual knowledge transfer for multi-class document classification on two benchmarking datasets for three language pairs. The proposed model outperforms the baseline LDA model, as well as the standard BiLDA model and two standard low-rank approximation methods (CL-LSI and CL-KCCA) used in previous work on this task.
机译:我们研究了从主题内容部分重叠的非平行多语言文本数据集中提取跨语言主题的问题(例如,两种不同语言的对齐Wikipedia文章)。为此,我们开发了一个新的双语概率主题模型,称为可比双语潜在Dirichlet分配(C-BiLDA),它能够处理此类可比数据,并且与标准双语LDA模型(BiLDA)不同,它不假定具有相同主题分布的文档对的可用性。我们提供C-BiLDA的完整概述,并显示其在跨语言知识转移任务中的实用性,该知识用于针对三个语言对的两个基准数据集进行多类文档分类。所提出的模型优于基线LDA模型,标准BiLDA模型以及先前在完成此任务时使用的两种标准低秩近似方法(CL-LSI和CL-KCCA)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号