首页> 外文期刊>Machine translation >Finding translations for low-frequency words in comparable corpora
【24h】

Finding translations for low-frequency words in comparable corpora

机译:在可比语料库中查找低频词的翻译

获取原文
获取原文并翻译 | 示例
       

摘要

Statistical methods to extract translational equivalents from non-parallel corpora hold the promise of ensuring the required coverage and domain customisation of lexicons as well as accelerating their compilation and maintenance. A challenge for these methods are rare, less common words and expressions, which often have low corpus frequencies. However, it is rare words such as newly introduced terminology and named entities that present the main interest for practical lexical acquisition. In this article, we study possibilities of improving the extraction of low-frequency equivalents from bilingual comparable corpora. Our work is carried out in the general framework which discovers equivalences between words of different languages using similarities between their occurrence patterns found in respective monolingual corpora. We develop a method that aims to compensate for insufficient amounts of corpus evidence on rare words: prior to measuring cross-language similarities, the method uses same-language corpus data to model co-occurrence vectors of rare words by predicting their unseen co-occurrences and smoothing rare, unreliable ones. Our experimental evaluation demonstrates that the proposed method delivers a consistent and significant improvement on the conventional approach to this task.
机译:从非并行语料库中提取翻译对等物的统计方法有望确保确保所需的词典覆盖范围和域自定义,并加快其编译和维护。这些方法面临的挑战是罕见的,较少见的单词和表达方式,它们通常具有较低的语料频率。但是,很少出现诸如新引入的术语和命名实体之类的词,它们代表了实际词汇习得的主要兴趣。在本文中,我们研究了改进从双语可比语料库中提取低频等效项的可能性。我们的工作是在通用框架中进行的,该框架使用在相应的单语语料库中发现的出现方式之间的相似性来发现不同语言的单词之间的等效性。我们开发了一种旨在弥补稀有词的语料证据不足的方法:在测量跨语言相似度之前,该方法使用相同语言的语料数据通过预测稀有词的共现向量来建模稀有词的共现向量并平滑稀有,不可靠的那些。我们的实验评估表明,所提出的方法在此方法的常规方法上提供了一致且显着的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号