首页> 外文会议>International conference on computational linguistics;COLING 2010 >The Noisier the Better: Identifying Multilingual Word Translations Using a Single Monolingual Corpus
【24h】

The Noisier the Better: Identifying Multilingual Word Translations Using a Single Monolingual Corpus

机译:噪音越大越好:使用单个单语语料库识别多语言单词翻译

获取原文

摘要

The automatic generation of dictionaries from raw text has previously been based on parallel or comparable corpora. Here we describe an approach requiring only a single monolingual corpus to generate bilingual dictionaries for several language pairs. A constraint is that all language pairs have their target language in common, which needs to be the language of the underlying corpus. Our approach is based on the observation that monolingual corpora usually contain a considerable number of foreign words. As these are often explained via translations typically occurring close by, we can identify these translations by looking at the contexts of a foreign word and by computing its strongest associations from these. In this work we focus on the question what results can be expected for 20 language pairs involving five major European languages. We also compare the results for two different types of corpora, namely newsticker texts and web corpora. Our findings show that results are best if English is the source language, and that noisy web corpora are better suited for this task than well edited newsticker texts.
机译:从原始文本自动生成字典以前是基于并行或可比语料库的。在这里,我们描述了一种只需要一个单语语料库即可生成几种语言对的双语词典的方法。一个约束是所有语言对都具有共同的目标语言,这需要是基础语料库的语言。我们的方法是基于以下观察:单语语料库通常包含相当多的外来词。由于这些通常是通过通常发生在附近的翻译来解释的,因此我们可以通过查看外来词的上下文并通过从中计算出最强的关联来识别这些翻译。在这项工作中,我们重点关注以下问题:涉及五种主要欧洲语言的20种语言对可以预期得到什么结果。我们还比较了两种不同类型的语料库的结果,即newsticker文本和Web语料库。我们的研究结果表明,如果英语是源语言,结果是最好的,而且嘈杂的网络语料库比经过良好编辑的newsticker文本更适合此任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号