首页> 外文会议>International Conference on Frontiers of Intelligent Computing : Theory and Applications >A Rule-Based Approach to Identify Stop Words for Gujarati Language
【24h】

A Rule-Based Approach to Identify Stop Words for Gujarati Language

机译:基于规则的方法来识别古吉拉蒂语言的停止词

获取原文

摘要

Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming and complex owing to file or dictionary preparation by collecting possible stop words from a large vocabulary, complex framework and a morphologically variant Gujarati document. Even the other proposed approaches in the literature are also very restricted due to their dependence on word-length, word-frequency, and/or training data set. For the first time in scientific community worldwide, this paper proposes a dynamic approach independent of all factors namely usage of file or dictionary, word-length, word-frequency, and training dataset. An 11 rule-based approach is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words. Extensive empirical evidence has been presented through deployment of proposed algorithm on nearly 600 Gujarati documents, categorized into routine and domain-specific categories. The respective results with 98.10 and 94.08% average accuracy show that the proposed approach is effective and promising enough for implementation in NLP tasks involving Gujarati written documents.
机译:停止单词删除是许多自然语言处理(NLP)任务的重要步骤。到目前为止,没有针对印度古吉拉特语言编写的文件创建的标准化,详尽的和动态停止单词列表,该文档由全球近6600万人讲话。最多的现有停止单词删除方法是基于文件或字典,其中使用硬编码的静态,非标准和单独创建的停止单词列表。由于文件或字典准备,通过收集来自大词汇,复杂框架和形态学变种的古吉拉特文档的可能的停止单词,现有方法是耗时和复杂。即使是文献中的其他提出的方法也是非常受限制的,因为它们对词长,词频和/或训练数据集的依赖性。在全球科学界的第一次,本文提出了一种独立于所有因素的动态方法,即文件或字典,字长,文字频率和训练数据集的使用。提出了11个规则的方法,重点是自动和动态识别Gujarati停止单词的完整列表。通过在近600名古吉拉特文档上部署拟议算法进行了广泛的经验证据,分为例程和特定于域的类别。各种结果,98.10和94.08%的平均准确度表明,该方法有效,充满了涉及古吉拉特书书面文件的NLP任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号