首页> 外文期刊>ACM transactions on Asian language information processing >Multilingual Topic Models for Bilingual Dictionary Extraction
【24h】

Multilingual Topic Models for Bilingual Dictionary Extraction

机译:双语词典提取的多语言主题模型

获取原文
获取原文并翻译 | 示例

摘要

A machine-readable bilingual dictionary plays a crucial role in many natural language processing tasks, such as statistical machine translation and cross-language information retrieval. In this article, we propose a framework for extracting a bilingual dictionary from comparable corpora by exploiting a novel combination of topic modeling and word aligners such as the IBM models. Using a multilingual topic model, we first convert a comparable document-aligned corpus into a parallel topic-aligned corpus. This novel topic-aligned corpus is similar in structure to the sentence-aligned corpus frequently employed in statistical machine translation and allows us to extract a bilingual dictionary using a word alignment model. The main advantages of our framework is that (1) no seed dictionary is necessary for bootstrapping the process, and (2) multilingual comparable corpora in more than two languages can also be exploited. In our experiments on a large-scale Wikipedia dataset, we demonstrate that our approach can extract higher precision dictionaries compared to previous approaches and that our method improves further as we add more languages to the dataset.
机译:机器可读的双语词典在许多自然语言处理任务(例如统计机器翻译和跨语言信息检索)中起着至关重要的作用。在本文中,我们提出了一个框架,该框架通过利用主题建模和单词对齐器(例如IBM模型)的新颖组合,从可比较的语料库中提取双语词典。使用多语言主题模型,我们首先将可比较的文档对齐的语料库转换为并行的主题对齐的语料库。这种新颖的主题对齐语料库在结构上与统计机器翻译中经常使用的句子对齐语料库相似,并允许我们使用单词对齐模型提取双语词典。我们框架的主要优点是:(1)引导过程不需要种子字典,(2)可以使用两种以上语言的多语言可比语料库。在大规模Wikipedia数据集上的实验中,我们证明了与以前的方法相比,我们的方法可以提取出更高精度的字典,并且随着向数据集添加更多语言,我们的方法得到了进一步的改进。

著录项

  • 来源
  • 作者单位

    Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan;

    Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan;

    Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0192, Japan;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Bilingual dictionary; multilingual topic model; comparable corpus;

    机译:双语词典;多语言主题模型;可比语料;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号