首页> 外文会议>Indian International Conference on Artificial Intelligence >A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models
【24h】

A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models

机译:基于IBM模型的扩展词对齐算法的Java实现

获取原文

摘要

In recent years statistical word alignment models have been widely used for various Natural Language Processing (NLP) problems. In this paper we describe a platform independent and object oriented implementation (in Java) of a word alignment algorithm. This algorithm is based on the first three IBM models. This is an ongoing work in which we are trying to explore the possible enhancements to the IBM models, especially for related languages like the Indian languages. We have been able to improve the performance by introducing a similarity measure (Dice coefficient), using a list of cognates and morph analyzer. Use of information about cognates is especially relevant for Indian languages because these languages have a lot of borrowed and inherited words which are common to more than one language. For our experiments on English-Hindi word alignment, we also tried to use a bilingual dictionary to bootstrap the Expectation Maximization (EM) algorithm. After training on 7399 sentence aligned sentences, we compared the results with GIZA++, an existing word alignment tool. The results indicate that though the performance of our word aligner is lower than that of GIZA++, it can be improved by adding some techniques like smoothing to take care of the data sparsity problem. We are also working on further improvements using morphological information and a better similarity measure etc. This word alignment tool is in the form of an API and is being developed as part of Sanchay, (a collection of tools and APIs for NLP with focus on Indian languages).
机译:近年来,统计词对齐模型已广泛用于各种自然语言处理(NLP)问题。在本文中,我们描述了单词对齐算法的独立和面向对象的平台和面向对象的实现(在Java中)。该算法基于前三个IBM模型。这是一项持续的工作,我们正试图探索IBM模型可能的增强功能,特别是对于印度语言等相关语言。我们能够通过使用同源名单和变形分析仪引入相似度测量(骰子系数)来提高性能。使用有关同源的信息与印度语言特别相关,因为这些语言有很多借来和继承的单词,这是一个以上的语言。对于我们对英语 - 印地文字对齐的实验,我们还尝试使用双语词典来引导期望最大化(EM)算法。在培训7399句子对齐的句子后,我们将结果与Giza ++进行了比较,现有的单词对齐工具。结果表明,虽然我们的单词对齐器的性能低于Giza ++的性能,但是可以通过添加一些如平滑的技术来改进,以便处理数据稀疏问题。我们还在使用形态学信息和更好的相似性措施等进一步改进。这个词对齐工具采用API的形式,并作为Sanchay的一部分开发(用于NLP的工具和API的集合,重点是印度人语言))。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号