A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models

机译：基于IBM模型的扩展词对齐算法的Java实现

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

In recent years statistical word alignment models have been widely used for various Natural Language Processing (NLP) problems. In this paper we describe a platform independent and object oriented implementation (in Java) of a word alignment algorithm. This algorithm is based on the first three IBM models. This is an ongoing work in which we are trying to explore the possible enhancements to the IBM models, especially for related languages like the Indian languages. We have been able to improve the performance by introducing a similarity measure (Dice coefficient), using a list of cognates and morph analyzer. Use of information about cognates is especially relevant for Indian languages because these languages have a lot of borrowed and inherited words which are common to more than one language. For our experiments on English-Hindi word alignment, we also tried to use a bilingual dictionary to bootstrap the Expectation Maximization (EM) algorithm. After training on 7399 sentence aligned sentences, we compared the results with GIZA++, an existing word alignment tool. The results indicate that though the performance of our word aligner is lower than that of GIZA++, it can be improved by adding some techniques like smoothing to take care of the data sparsity problem. We are also working on further improvements using morphological information and a better similarity measure etc. This word alignment tool is in the form of an API and is being developed as part of Sanchay, (a collection of tools and APIs for NLP with focus on Indian languages).

机译：近年来，统计词对齐模型已广泛用于各种自然语言处理（NLP）问题。在本文中，我们描述了单词对齐算法的独立和面向对象的平台和面向对象的实现（在Java中）。该算法基于前三个IBM模型。这是一项持续的工作，我们正试图探索IBM模型可能的增强功能，特别是对于印度语言等相关语言。我们能够通过使用同源名单和变形分析仪引入相似度测量（骰子系数）来提高性能。使用有关同源的信息与印度语言特别相关，因为这些语言有很多借来和继承的单词，这是一个以上的语言。对于我们对英语 - 印地文字对齐的实验，我们还尝试使用双语词典来引导期望最大化（EM）算法。在培训7399句子对齐的句子后，我们将结果与Giza ++进行了比较，现有的单词对齐工具。结果表明，虽然我们的单词对齐器的性能低于Giza ++的性能，但是可以通过添加一些如平滑的技术来改进，以便处理数据稀疏问题。我们还在使用形态学信息和更好的相似性措施等进一步改进。这个词对齐工具采用API的形式，并作为Sanchay的一部分开发（用于NLP的工具和API的集合，重点是印度人语言））。

著录项

来源
《Indian International Conference on Artificial Intelligence》|2007年||共17页
会议地点
作者
G. Chinnappa; Anil Kumar Singh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词
Word alignment; Statistical machine translation; IBM models; Java; Indian languages; Cognates; Sanchay;

机译：字对齐;统计机器翻译;IBM模型;Java;印度语言;同源;Sanchay;

相似文献

外文文献
中文文献
专利

1. A word alignment model based on multiobjective evolutionary algorithms [J] . Yidong Chen, Xiaodong Shi, Changle Zhou, Computers & mathematics with applications . 2009,第11a12期

机译：基于多目标进化算法的词对齐模型
2. Analysis of Computational Complexity for HT-Based Fingerprint Alignment Algorithms on Java Card Environment [J] . Cynthia S. Mlambo, Meshack B. Shabalala, Fulufhelo V. Nelwamondo Computer Science & Information Technology . 2015,第2期

机译：Java卡环境下基于HT的指纹对齐算法的计算复杂性分析
3. Quantum Algorithms and Experiment Implementations Based onIBMQ [J] . Wenjie Liu, Junxiu Chen, Yinsong Xu, Computers, Materials & Continua . 2020,第2期

机译：基于onibMQ的量子算法和实验实现
4. A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models [C] . G. Chinnappa, Anil Kumar Singh Indian International Conference on Artificial Intelligence . 2007

机译：基于IBM模型的扩展词对齐算法的Java实现
5. Multilingual model using cross-lingual word embeddings based on subword alignment and cross-task projection利用統計を見る [D] . Sakuma Jin 2019

机译：使用基于子词对齐和跨任务投影的跨语言词嵌入的多语言模型
6. A Graph-Based Extension for the Set-Based Model Implementing Algorithms Based on Important Nodes [O] . Nikitas-Rigas Kalogeropoulos, Ioannis Doukas, Christos Makris, -1

机译：基于重要节点的基于集的模型实现算法的基于图的扩展
7. A word alignment model based on multiobjective evolutionary algorithms [O] . Chen Yidong, Shi Xiaodong, Zhou Changle, 2009

机译：基于多目标进化算法的词对齐模型

A Java Implementation of an Extended Word Alignment Algorithm Based on the IBM Models

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅