首页> 外文期刊>ACM transactions on Asian language information processing >Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension
【24h】

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

机译:双语词典扩展中枢轴词的统计提取和比较

获取原文
获取原文并翻译 | 示例

摘要

Bilingual dictionaries can be automatically extended by new translations using comparable corpora. The general idea is based on the assumption that similar words have similar contexts across languages. How-ever, previous studies have mainly focused on Indo-European languages, or use only a bag-of-words model to describe the context. Furthermore, we argue that it is helpful to extract only the statistically significant context, instead of using all context. The present approach addresses these issues in the following manner. First, based on the context of a word with an unknown translation (query word), we extract salient pivot words. Pivot words are words for which a translation is already available in a bilingual dictionary. For the extraction of salient pivot words, we use a Bayesian estimation of the point-wise mutual information to measure statistical significance. In the second step, we match these pivot words across languages to identify translation candidates for the query word. We therefore calculate a similarity score between the query word and a translation candidate using the probability that the same pivots will be extracted for both the query word and the translation candidate. The proposed method uses several context positions, namely, a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the depen-dency parse tree of the sentence. In order to make these context positions comparable across Japanese and English, which are unrelated languages, we use several heuristics to adjust the dependency trees appropri-ately. We demonstrate that the proposed method significantly increases the accuracy of word translations, as compared to previous methods.
机译:可以使用类似的语料库通过新翻译自动扩展双语词典。一般的想法是基于这样的假设:相似的词在各种语言中具有相似的上下文。但是,以前的研究主要集中在印欧语言上,或者仅使用词袋模型来描述上下文。此外,我们认为仅提取具有统计意义的上下文而不是使用所有上下文是有帮助的。本方法以以下方式解决这些问题。首先,基于具有未知翻译的词(查询词)的上下文,提取显着的枢轴词。枢轴词是双语词典中已经提供翻译的词。对于显着枢轴词的提取,我们使用点式互信息的贝叶斯估计来测量统计显着性。在第二步中,我们将这些支点词跨语言进行匹配,以标识查询词的翻译候选。因此,我们使用将为查询词和翻译候选词提取相同枢轴的概率来计算查询词和翻译候选词之间的相似性得分。所提出的方法使用多个上下文位置,即一个句子的词袋,以及该句子的依赖关系分析树的后继者,前任者和同级兄弟。为了使这些上下文位置在不相关的日语和英语之间具有可比性,我们使用几种试探法适当地调整依赖关系树。我们证明,与以前的方法相比,该方法大大提高了单词翻译的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号