首页> 外文期刊>ACM transactions on Asian language information processing >Mining English-Chinese Named Entity Pairs from Comparable Corpora
【24h】

Mining English-Chinese Named Entity Pairs from Comparable Corpora

机译:从可比语料库中挖掘英汉命名实体对

获取原文
获取原文并翻译 | 示例

摘要

Bilingual Named Entity (NE) pairs are valuable resources for many NLP applications. Since comparable corpora are more accessible, abundant and up-to-date, recent researches have concentrated on mining bilingual lexicons using comparable corpora. Leveraging comparable corpora, this research presents a novel approach to mining English-Chinese NE translations by combining multi-dimension features from various information sources for every possible NE pair, which include the transliteration model, English-Chinese matching, Chinese-English matching, translation model, length, and context vector. These features are integrated into one model with linear combination and minimum sample risk (MSR) algorithm. As for the high type-dependence of NE translation, we integrate different features according to different NE types. We experiment with the above individual feature or integrated features to mine person NE (PN) pairs, location NE (LN) pairs and organization NE (ON) pairs. When using transliteration and length to mine PN pairs, we achieve the best performance of 84.9% (F-score). The LN pairs can be mined with the features of transliteration model, length, translation model, English-Chinese matching and Chinese-English matching. And the best performance is 83.4% (F-score). The ON pairs can be mined with the features of English-Chinese matching and Chinese-English matching. It reaches the best performance with 84.1% (F-score).
机译:双语命名实体(NE)对是许多NLP应用程序的宝贵资源。由于可比语料库更易于访问,丰富并且是最新的,因此最近的研究集中在使用可比语料库挖掘双语词典中。利用可比语料库,该研究提出了一种新颖的方法,通过将来自各种信息源的多维特征组合到每个可能的NE对中来挖掘英汉NE翻译,包括音译模型,英汉匹配,汉英匹配,翻译模型,长度和上下文向量。这些功能通过线性组合和最小样本风险(MSR)算法集成到一个模型中。至于网元翻译的高度类型依赖性,我们根据网元的不同类型集成了不同的功能。我们尝试使用上述单个特征或集成特征来挖掘人员NE(PN)对,位置NE(LN)对和组织NE(ON)对。当使用音译和长度来挖掘PN对时,我们可以达到84.9%(F分数)的最佳性能。 LN对可以通过音译模型,长度,翻译模型,英汉匹配和汉英匹配等特征进行挖掘。最佳性能为83.4%(F分数)。可以使用英汉匹配和汉英匹配的特征来挖掘ON对。它以84.1%(F分数)达到最佳性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号