首页> 外文会议>International conference on big data analytics >Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language
【24h】

Comparative Analysis of Rule-Based, Dictionary-Based and Hybrid Stemmers for Gujarati Language

机译:古吉拉特语基于规则,基于字典和混合词干的比较分析

获取原文

摘要

Gujarati is an Indo-Aryan language spoken substantially by-people of Gujarat state of India. It is highly and actively used for communication in Gujarat government's educational institutes and offices, local industries, businesses as well as in media such as newspapers, magazines, radio and television programs. In all these areas, Internet is the keen requirement today. Its utilization will be increased if contents are provided on web in regional language using the notion of Natural Language Processing (NLP). In NLP, stemming plays a vital role in retrieving accurate contents and producing effective results for web search query. It identifies the root word from morphological variants of respective word. There are three typical approaches to perform stemming: rule-based approach, dictionary-based approach and hybrid approach. In this paper, we present a comparative empirical study of these three approaches for Gujarati language. The aim of the study is to evaluate the effectiveness of different types of stemmers for Gujarati language. Firstly, we discuss the rule-based algorithm and present its evaluation with 152 different suffix stripping rules. Next, we illustrate stemming mechanism developed using Gujarati dictionary that contains around 20000 root words. Lastly, we discuss the hybrid approach that is a combination of rule-based and dictionary-based approaches. Experimental results reveal that hybrid approach retrieves more accurate stemmed words compared to rule-based and dictionary-based approaches.
机译:古吉拉特语是印度古吉拉特邦人所说的印度-雅利安语。它在古吉拉特邦政府的教育机构和办公室,本地行业,企业以及报纸,杂志,广播和电视节目等媒体中得到了广泛而积极的交流。在所有这些领域中,互联网是当今的迫切需求。如果使用自然语言处理(NLP)的概念以区域语言在Web上提供内容,则会提高其利用率。在NLP中,词干在检索准确的内容并为Web搜索查询产生有效的结果方面起着至关重要的作用。它从各个词的形态变异中识别出词根。有三种典型的执行词干的方法:基于规则的方法,基于字典的方法和混合方法。在本文中,我们对古吉拉特语的这三种方法进行了比较实证研究。这项研究的目的是评估古吉拉特语的不同类型词干的有效性。首先,我们讨论基于规则的算法,并用152种不同的后缀剥离规则对算法进行评估。接下来,我们说明使用古吉拉特语字典开发的词干机制,该字典包含大约20000个词根。最后,我们讨论了混合方法,该方法是基于规则和基于字典的方法的组合。实验结果表明,与基于规则的方法和基于字典的方法相比,混合方法检索的词干词更准确。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号