首页> 外文期刊>ACM transactions on Asian language information processing >Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from Wikipedia
【24h】

Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from Wikipedia

机译:使用来自维基百科的辅助语言语料库基于语料库的印度语言翻译归纳

获取原文
获取原文并翻译 | 示例
       

摘要

Identifying translations from comparable corpora is a well-known problem with several applications. Existing methods rely on linguistic tools or high-quality corpora. Absence of such resources, especially in Indian languages, makes this problem hard; for example, state-of-the-art techniques achieve a mean reciprocal rank of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. In this work, we address the problem of comparable corpora-based translation correspondence induction (CC-TCI) when the only resources available are small noisy comparable corpora extracted from Wikipedia. We observe that translations in the source and target languages have many topically related words in common in other "auxiliary" languages. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for CC-TCI. Extensive experiments on 35 comparable corpora showed dramatic improvements in performance. We extend these ideas to propose a method for measuring cross-lingual semantic relatedness (CLSR) between words. To stimulate further research in this area, we make publicly available two new high-quality human-annotated datasets for CLSR. Experiments on the CLSR datasets show more than 200% improvement in correlation on the CLSR task. We apply the method to the real-world problem of cross-lingual Wikipedia title suggestion and build the WikiTSu system. A user study on WikiTSu shows a 20% improvement in the quality of titles suggested.
机译:识别可比语料库的翻译是几个应用程序中的一个众所周知的问题。现有方法依赖于语言工具或高质量的语料库。这种资源的缺乏,特别是印度语言的缺乏,使这个问题变得很难解决。例如,最先进的技术对英语-意大利语的平均倒数排名为0.66,对泰卢固语-卡纳达语的平均倒数排名仅为0.187。在这项工作中,当唯一可用的资源是从维基百科中提取的小噪音可比语料库时,我们将解决基于可比语料库的翻译对应归纳(CC-TCI)问题。我们观察到,源语言和目标语言中的翻译在其他“辅助”语言中有许多常见的局部相关词。为了对此建模,我们定义了跨语言主题的概念,这是一组来自辅助语言语料库的局部相关词,并提出了CC-TCI的概率框架。在35个可比语料库上进行的大量实验表明,其性能有了显着提高。我们扩展这些想法,以提出一种用于测量单词之间的跨语言语义相关性(CLSR)的方法。为了激发在这一领域的进一步研究,我们公开提供了两个新的高质量的人工注释的CLSR数据集。在CLSR数据集上进行的实验表明,CLSR任务的相关性提高了200%以上。我们将该方法应用于跨语言维基百科标题建议的现实问题,并构建了WikiTSu系统。 WikiTSu上的一项用户研究显示,建议标题的质量提高了20%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号