首页> 外文期刊>Malaysian Journal of Computer Science >Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers
【24h】

Exhaustive Affix Stripping And A Malay Word Register To Solve Stemming Errors And Ambiguity Problem In Malay Stemmers

机译:详尽的词缀剥离和马来语单词寄存器来解决马来词干中的词干错误和歧义问题

获取原文
       

摘要

Stemmers or word stemming algorithms reduce a derivative word to its root word by removing all the affixes. The complexity of Malay Language (ML) morphological rules and Malay lexicon make stemming Malay words difficult. There is no fixed method to determine the affix to be removed from a derivative word to produce the correct root word. Furthermore, a derivative word could contain one or more valid root words. Stemming errors still exist in the previous Malay Language Stemmers (MLS). Regardless of the approaches used, they rely on the first affix matched or the first root word found. Hence, some words were under stemmed or over stemmed while words with many valid root words were not stemmed to reveal the correct root word. This multiple root words or ambiguity problem, however, has never been addressed by previous MLS. To solve the over stemming and under stemming errors, we propose an approach that exhaustively strips all matched affixes to ensure that a valid root word will be extracted. In addition, we also propose the use of a Malay Word Register to address the ambiguity problem of determining the correct root word. We tested the proposed approach with words from newspaper articles, Malay translation of the Quran, History essays and incorrectly stemmed words from the previous stemmers. The results reveal this stemmer is successful with 99.8% accuracy. There were no stemming errors. The imperfect accuracy is due to the ambiguity problem approach.
机译:词干或词干算法通过删除所有词缀来将派生词还原为其根词。马来语言(ML)的词法规则和马来词库的复杂性使词干马来词变得困难。没有固定的方法来确定要从派生词中删除的词缀以产生正确的词根词。此外,派生词可以包含一个或多个有效的根词。先前的马来语词干(MLS)中仍然存在词干错误。无论使用哪种方法,它们都依赖于匹配的第一个词缀或找到的第一个根词。因此,某些词的词干过少或词干过大,而带有许多有效词根的词却没有被词干揭示正确的词根。但是,以前的MLS从未解决过这个多个词根或歧义问题。为了解决过高和过低的错​​误,我们提出了一种方法,该方法彻底去除所有匹配的词缀,以确保提取出有效的词根。此外,我们还建议使用马来语单词寄存器来解决确定正确词根的歧义问题。我们用报纸文章中的单词,古兰经的马来语翻译,历史文章以及以前的词干错误地提取出的词来测试提出的方法。结果表明,该茎杆成功率高达99.8%。没有错误。准确性不佳是由于歧义问题方法造成的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号