首页> 外文会议>Language Engineering Conference >Towards Indian language spell-checker design
【24h】

Towards Indian language spell-checker design

机译:对印度语言拼写检查设计

获取原文

摘要

This paper deals with the development of spellchecker in Indian Languages with an example in Bangla, the second most popular language in Indian Subcontinent. A brief review of problems and current scenario of Indian language spell-checkers is described. Then the approach on Bangla spell-checker is elaborated. In this approach the technique works in two stages. The first stage takes care of phonetic similarity error. For that the phonetically similar characters are mapped into single units of character code. A new dictionary D{sub}c is constructed with this reduced set of alphabet A phonetically similar but wrongly spelt word can be easily corrected using this dictionary. The second stage takes care of errors other than phonetic similarity. Here wrongly spelt word S of n characters is searched in the dictionary D{sub}c. If S is a nonword, its first k{sub}1≤n characters will match with a valid word in D{sub}c. (if k{sub}1=n then the word in D{sub}c must be longer than n). A reversed word dictionary D{sub}r is also generated where the characters of the word are maintained in a reversed order. If the last k{sub}2 characters of S match with a word in D{sub}r then, for single error, it is located within the intersection region of first k{sub}1+1 and last k{sub}2 +1 characters of S. We observed that this region is very small compared to word length for most cases and the number of suggested correct words can be drastically reduced using this information. We have used our approach in correcting Bangla text, where the problem of inflection is tackled by a simplified version of morphological analyser. Another problem encountered in Indian languages is the existence of large number of compound words formed by Euphony and Assimilation. The problem of compound words is also carefully tackled.
机译:本文涉及印度语言的拼写器的发展,其中孟加拉在印度次大陆的第二个最受欢迎的语言中。描述了对印度语言拼写检查的问题和当前情景的简要述评。然后阐述了Bangla Spell-Checker的方法。在这种方法中,该技术在两个阶段工作。第一阶段负责监听语音相似性错误。因为该语音地类似的字符被映射到字符代码的单个单位。使用该字母一组简化的字母组构造了新的字典D {sub} C,可以使用此字典容易地校正语音相似但错误拼写的单词。第二阶段照顾语音相似性以外的错误。在这里,在字典D {sub} c中搜索n个字符的错误字样。如果s是nonword,则其第一个k {sub}1≤n字符将与d {sub} c中的有效字匹配。 (如果k {sub} 1 = n那么d {sub} c中的字必须长于n)。还生成了反转的单词字典D {Sub} R,其中单词的字符以反向顺序维护。如果对于单个错误,对于单个错误,则匹配的最后k {sub} 2个字符与d {sub} r中的单词,它位于第一k {sub} 1 + 1 + 1和最后k {sub} 2的交叉区域内+1的S.我们观察到,与大多数情况下,这个区域与字的字数相比非常小,并且可以使用这些信息大大减少建议正确的单词的数量。我们使用我们在纠正Bangla文本方面的方法,其中通过简化版本的形态分析仪解决了拐点的问题。印度语言遇到的另一个问题是存在大量由谐波和同化形成的复合词。复合词的问题也仔细解决。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号