首页> 外文会议>International Joint Conference on Neural Networks >Towards mining bilingual lexicons and parallel phrases from large-scale monolingual corpora
【24h】

Towards mining bilingual lexicons and parallel phrases from large-scale monolingual corpora

机译:从大规模单语语料库中挖掘双语词汇和平行短语

获取原文

摘要

Bilingual lexicons and parallel phrases have a great effect on certain tasks of natural language processing (NLP). Recent researches have proved that the high-quality bilingual lexicons can hence the performance of the machine translation. When it comes to some special tasks of NLP, the incorporation of bilingual lexicons can bring about obvious effectiveness. The bilingual lexicons and parallel phrases can be easily extracted from parallel corpora, but in contrast to the monolingual corpora, the number of parallel corpora is still scarce. Actually, the monolingual corpora also have the potential to mine a large amount of parallel word and phrase pairs. In this paper, we propose two strategies to extract parallel words and phrases from monolingual corpora. On one hand, we present the indirect mining strategy, Anchored Mining (AM), which injects the anchoring point into each mining procedure to improve the accuracy. On the other hand, inspired by the process of humans learning a foreign language, we further propose another novel, direct algorithm named Bootstrapping Mining (BM), which mimics the human learning process and aims to learn parallel phrases automatically in a self-iterative way. Additionally, we propose a novel metric, phrase probability-sub item average probability (PP-SAP), which is applied to quantitatively evaluate the rationality of each extracted parallel phrase pair in the monolingual corpora. We conduct the experiments on large-scale English-Chinese, English-Russia, and English-France monolingual corpora, and the results show that our methods can mine high-quality bilingual lexicons and parallel phrases. We also evaluate our algorithms on low-resource monolingual corpora and get good results as well.
机译:双语词汇和平行短语对自然语言处理的某些任务有很大影响。近年来的研究表明,高质量的双语词典可以提高机器翻译的性能。当涉及到NLP的一些特殊任务时,双语词汇的结合可以带来明显的效果。双语词汇和平行短语可以很容易地从平行语料库中提取,但与单语语料库相比,平行语料库的数量仍然很少。事实上,单语语料库还具有挖掘大量平行词和短语对的潜力。本文提出了两种从单语语料库中提取平行词和短语的策略。一方面,我们提出了间接挖掘策略,锚定挖掘(AM),它将锚定点注入到每个挖掘过程中,以提高精度。另一方面,受人类学习外语过程的启发,我们进一步提出了另一种新的直接算法,名为Bootstrapping Mining(BM),它模仿人类学习过程,旨在以自迭代的方式自动学习平行短语。此外,我们提出了一种新的度量方法,短语概率子项平均概率(PP-SAP),用于定量评估单语语料库中提取的每个平行短语对的合理性。我们在大规模英汉、英俄和英法单语语料库上进行了实验,结果表明我们的方法能够挖掘出高质量的双语词汇和平行短语。我们也在低资源的单语语料库上评估了我们的算法,并得到了良好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号