...
【24h】

Stemming Resource-Poor Indian Languages

机译:阻止资源贫乏的印度语言

获取原文
获取原文并翻译 | 示例

摘要

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While Assamese, Bengali and Bishnupriya Manipuri are Indo-Aryan, Bodo is a Tibeto-Burman language. We design a rule-based approach to remove suffixes from words. To reduce over-stemming and under-stemming errors, we introduce a dictionary of frequent words. We observe that, for these languages a dominant amount of suffixes are single letters creating problems during suffix stripping. As a result, we introduce an HMM-based hybrid approach to classify the mis-matched last character. For each word, the stem is extracted by calculating the most probable path in four HMM states. At each step we measure the stemming accuracy for each language. We obtain 94% accuracy for Assamese and Bengali and 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively, using the hybrid approach. We compare our work with Morfessor [Creutz and Lagus 2005]. As of now, there is no reported work on stemming for Bishnupriya Manipuri and Bodo. Our results on Assamese and Bengali show significant improvement over prior published work [Sarkar and Bandyopadhyay 2008; Sharma et al. 2002, 2003].
机译:词干是自然语言文本的形态规范化的基本方法。在这项研究中,我们重点研究了从印度东部,阿萨姆语,孟加拉语,比什努普里雅·曼尼普里语和博多语中提取几种资源匮乏的语言的问题。阿萨姆语,孟加拉语和Bishnupriya Manipuri是印度-雅利安语,而Bodo是藏缅语。我们设计了一种基于规则的方法来从单词中删除后缀。为了减少过度填充和不足的错误,我们引入了常用单词词典。我们观察到,对于这些语言,后缀占优势的是单个字母,在后缀剥离期间会产生问题。因此,我们引入了基于HMM的混合方法来对不匹配的最后一个字符进行分类。对于每个单词,通过计算四种HMM状态中最可能的路径来提取词干。在每个步骤中,我们都会测量每种语言的词干准确性。使用混合方法,对于阿萨姆语和孟加拉语,我们分别获得94%的准确度,对比什努普里亚人Manipuri和Bodo的准确度分别为87%和82%。我们将我们的工作与Morfessor [Creutz and Lagus 2005]进行比较。截至目前,尚无关于Bishnupriya Manipuri和Bodo的茎干研究的报道。我们在阿萨姆语和孟加拉语上的研究结果显示,与以前发表的著作相比,有了很大的改进[Sarkar和Bandyopadhyay 2008; Sharma等。 2002、2003]。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号