A Rule-Based Approach to Identify Stop Words for Gujarati Language

机译：基于规则的方法来识别古吉拉蒂语言的停止词

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming and complex owing to file or dictionary preparation by collecting possible stop words from a large vocabulary, complex framework and a morphologically variant Gujarati document. Even the other proposed approaches in the literature are also very restricted due to their dependence on word-length, word-frequency, and/or training data set. For the first time in scientific community worldwide, this paper proposes a dynamic approach independent of all factors namely usage of file or dictionary, word-length, word-frequency, and training dataset. An 11 rule-based approach is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words. Extensive empirical evidence has been presented through deployment of proposed algorithm on nearly 600 Gujarati documents, categorized into routine and domain-specific categories. The respective results with 98.10 and 94.08% average accuracy show that the proposed approach is effective and promising enough for implementation in NLP tasks involving Gujarati written documents.

机译：停止单词删除是许多自然语言处理（NLP）任务的重要步骤。到目前为止，没有针对印度古吉拉特语言编写的文件创建的标准化，详尽的和动态停止单词列表，该文档由全球近6600万人讲话。最多的现有停止单词删除方法是基于文件或字典，其中使用硬编码的静态，非标准和单独创建的停止单词列表。由于文件或字典准备，通过收集来自大词汇，复杂框架和形态学变种的古吉拉特文档的可能的停止单词，现有方法是耗时和复杂。即使是文献中的其他提出的方法也是非常受限制的，因为它们对词长，词频和/或训练数据集的依赖性。在全球科学界的第一次，本文提出了一种独立于所有因素的动态方法，即文件或字典，字长，文字频率和训练数据集的使用。提出了11个规则的方法，重点是自动和动态识别Gujarati停止单词的完整列表。通过在近600名古吉拉特文档上部署拟议算法进行了广泛的经验证据，分为例程和特定于域的类别。各种结果，98.10和94.08％的平均准确度表明，该方法有效，充满了涉及古吉拉特书书面文件的NLP任务。

著录项

来源
《International Conference on Frontiers of Intelligent Computing : Theory and Applications》|2017年|xxiv 809 pages :|共10页
会议地点
作者
Rajnish M. Rakholia; Jatinderkumar R. Saini;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301.4-532;
关键词
Gujarati; Natural Language Processing (NLP); Rule-based approach; Stop word;

机译：古吉拉蒂;自然语言处理（NLP）;基于规则的方法;停止单词;
入库时间 2022-08-21 12:15:54

相似文献

外文文献
中文文献
专利

1. Hierarchical and sequential processing of language: A response to: Ding, Melloni, Tian, and Poeppel (2017). Rule-based and word-level statistics-based processing of language: insights from neuroscience. Language, Cognition and Neuroscience. [J] . Frank Stefan L., Christiansen Morten H. Language, cognition and neuroscience . 2018,第9期

机译：语言的分层和顺序处理：响应：丁，Melloni，Tian和Poeppel（2017）。基于规则和基于词语级别的语言处理：神经科学的见解。语言，认知和神经科学。
2. A Novel Approach for Gujarati Handwritten Text Lines and Words Segmentation [J] . J. V. Nasriwala, B. C. Patel National Journal of System and Information Technology . 2014,第1期

机译：古吉拉特语手写文本行和分词的一种新方法
3. Gujarati Language Speech Recognition System for Identifying Smartphone Operation Commands [J] . Jigisha K. Patel, Pritesh N. Patel, Paresh V. Virparia National Journal of System and Information Technology . 2015,第2期

机译：古吉拉特语语音识别系统，用于识别智能手机操作命令
4. A Rule-Based Approach to Identify Stop Words for Gujarati Language [C] . Rajnish M. Rakholia, Jatinderkumar R. Saini International Conference on Frontiers of Intelligent Computing : Theory and Applications . 2017

机译：基于规则的方法来识别古吉拉特语言的停止词
5. AN ANALYSIS OF WORDS SELECTED BY KINDERGARTEN AND FIRST-GRADE CHILDREN FROM LANGUAGE EXPERIENCE STORIES IN ORDER TO TEST THREE BASIC ASSUMPTIONS OF THE LANGUAGE-EXPERIENCE APPROACH TO TEACHING BEGINNING READING. [D] . MUTHLEB, VERA EVELYN PATE. 1976

机译：对幼儿园和一年级儿童从语言经验故事中选择的单词进行分析，以测试在开始阅读时对语言体验方法的三种基本假设。
6. Rule-based and Word-level Statistics-based Processing of Language: Insights from Neuroscience [O] . Nai Ding, Lucia Melloni, Xing Tian, -1

机译：基于规则和基于单词级统计的语言处理：来自神经科学的见解
7. Rule-based Approach for Arabic Root Extraction: New Rules to Directly Extract Roots of Arabic Words [O] . Fatma Abu Hawas, Keith E. Emmert 2016

机译：基于规则的阿拉伯语根提取方法：直接提取阿拉伯语词根的新规则
8. Military Typesetting Equipment and Systems for Indo-Aryan and Dravidian Languages (Hindi, Marathi, Bengali, Punjabi, Gujarati, Malayalam, Tamil, and Telugu) (1961-1963) [R] . Nitenson, E. 1964

机译：印度 - 雅利安语和德拉威语的军事排版设备和系统（印地语，马拉地语，孟加拉语，旁遮普语，古吉拉特语，马拉雅拉姆语，泰米尔语和泰卢固语）（1961-1963）

A Rule-Based Approach to Identify Stop Words for Gujarati Language

摘要

著录项

相似文献

相关主题

期刊订阅