首页> 外文会议>International Conference on Frontiers of Intelligent Computing : Theory and Applications >A Rule-Based Approach to Identify Stop Words for Gujarati Language
【24h】

A Rule-Based Approach to Identify Stop Words for Gujarati Language

机译:基于规则的方法来识别古吉拉特语言的停止词

获取原文
获取外文期刊封面目录资料

摘要

Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming and complex owing to file or dictionary preparation by collecting possible stop words from a large vocabulary, complex framework and a morphologically variant Gujarati document. Even the other proposed approaches in the literature are also very restricted due to their dependence on word-length, word-frequency, and/or training data set. For the first time in scientific community worldwide, this paper proposes a dynamic approach independent of all factors namely usage of file or dictionary, word-length, word-frequency, and training dataset. An 11 rule-based approach is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words. Extensive empirical evidence has been presented through deployment of proposed algorithm on nearly 600 Gujarati documents, categorized into routine and domain-specific categories. The respective results with 98.10 and 94.08% average accuracy show that the proposed approach is effective and promising enough for implementation in NLP tasks involving Gujarati written documents.
机译:停用词去除在许多自然语言处理(NLP)任务的重要一步。截至目前,没有规范,详尽的,并且这是由近6600万人讲全世界写在印度古吉拉特语语言文档动态停用词列表中创建。大多数现有的停止词去除方法是文件或基于字典的,其中一个硬编码的静态,非标准化,并停用词单独创建的列表中。现有的方法是耗时的和复杂的,由于文件或通过从一个大的词汇表,复杂的框架和一个形态变体古吉特拉文件收集可能停止词词典制备。即使在文献中其它提出的方法也非常有限,因为它们在字长,字频,和/或训练数据集的依赖。对于科学界世界范围内首次提出独立的所有因素动态方法,即使用文件或字典,字长,字频,和训练数据集。提出专注于古吉拉特停止字的完整列表的自动和动态识别11基于规则的方法。大量的实证研究已经通过算法的部署提出了对近600古吉拉特文件,分为日常和特定域的类别。以98.10和94.08%的平均准确度表明,该方法是有效的,在涉及古吉拉特语书面文件NLP任务落实不够看好各自的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号