首页> 外文会议>Annual International Conference on Intelligent Text Processing and Computational Linguistics >An Improved Stemming Approach Using HMM for a Highly Inflectional Language
【24h】

An Improved Stemming Approach Using HMM for a Highly Inflectional Language

机译:一种利用HMM实现高拐点语言的改进的茎秆方法

获取原文

摘要

Stemming is a common method for morphological normalization of natural language texts. Modern information retrieval systems rely on such normalization techniques for automatic document processing tasks. High quality stemming is difficult in highly inflectional Indic languages. Little research has been performed on designing algorithms for stemming of texts in Indic languages. In this study, we focus on the problem of stemming texts in Assamese, a low resource Indic language spoken in the North-Eastern part of India by approximately 30 million people. Stemming is hard in Assamese due to the common appearance of single letter suffixes as morphological inflections. More than 50% of the inflections in Assamese appear as single letter suffixes. Such single letter morphological inflections cause ambiguity when predicting underlying root word. Therefore, we propose a new method that combines a rule based algorithm for predicting multiple letter suffixes and an HMM based algorithm for predicting the single letter suffixes. The combined approach can predict morphologically inflected words with 92% accuracy.
机译:Stemming是自然语言文本的形态标准化的常见方法。现代信息检索系统依赖于自动文档处理任务的此类标准化技术。高质量的茎秆在高度拐点的方向语言中很难。对设计算法进行了少量研究,以源于indical语言的文本。在这项研究中,我们专注于印度东北部门的敏感文本的敏感文本的问题,大约有3000万人。由于单个字母后缀的常见外观作为形态变形,所令人遗憾的是。 assamese中超过50%的折射显示为单个字母后缀。这种单字母形态拐点在预测底层根系时导致模糊性。因此,我们提出了一种新方法,该方法结合了基于规则的算法来预测多个字母后缀和基于HMM的算法,用于预测单个字母后缀。组合方法可以预测92%精度的形态上变形的单词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号