A Rule-Based Approach to Identify Stop Words for Gujarati Language

机译：基于规则的方法来识别古吉拉特语言的停止词

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Stop words removal is an important step in many natural language processing (NLP) tasks. Till now, there is no standardized, exhaustive, and dynamic stop word list created for documents written in Indian Gujarati language which is spoken by nearly 66 million people worldwide. Most of the existing stop words removal approaches are file or dictionary based, wherein a hard-coded static, nonstandardized, and individually created list of stop words is used. The existing approaches are time consuming and complex owing to file or dictionary preparation by collecting possible stop words from a large vocabulary, complex framework and a morphologically variant Gujarati document. Even the other proposed approaches in the literature are also very restricted due to their dependence on word-length, word-frequency, and/or training data set. For the first time in scientific community worldwide, this paper proposes a dynamic approach independent of all factors namely usage of file or dictionary, word-length, word-frequency, and training dataset. An 11 rule-based approach is presented focusing on automatic and dynamic identification of a complete list of Gujarati stop words. Extensive empirical evidence has been presented through deployment of proposed algorithm on nearly 600 Gujarati documents, categorized into routine and domain-specific categories. The respective results with 98.10 and 94.08% average accuracy show that the proposed approach is effective and promising enough for implementation in NLP tasks involving Gujarati written documents.

机译：停用词去除在许多自然语言处理（NLP）任务的重要一步。截至目前，没有规范，详尽的，并且这是由近6600万人讲全世界写在印度古吉拉特语语言文档动态停用词列表中创建。大多数现有的停止词去除方法是文件或基于字典的，其中一个硬编码的静态，非标准化，并停用词单独创建的列表中。现有的方法是耗时的和复杂的，由于文件或通过从一个大的词汇表，复杂的框架和一个形态变体古吉特拉文件收集可能停止词词典制备。即使在文献中其它提出的方法也非常有限，因为它们在字长，字频，和/或训练数据集的依赖。对于科学界世界范围内首次提出独立的所有因素动态方法，即使用文件或字典，字长，字频，和训练数据集。提出专注于古吉拉特停止字的完整列表的自动和动态识别11基于规则的方法。大量的实证研究已经通过算法的部署提出了对近600古吉拉特文件，分为日常和特定域的类别。以98.10和94.08％的平均准确度表明，该方法是有效的，在涉及古吉拉特语书面文件NLP任务落实不够看好各自的结果。

著录项

来源
《International Conference on Frontiers of Intelligent Computing : Theory and Applications》|2017年|xxiv 809 pages :|共10页
会议地点
作者
Rajnish M. Rakholia; Jatinderkumar R. Saini;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP301.4-532;
关键词
Gujarati; Natural Language Processing (NLP); Rule-based approach; Stop word;

机译：古吉拉蒂;自然语言处理（NLP）;基于规则的方法;停止单词;

相似文献

外文文献
中文文献
专利

1. Hierarchical and sequential processing of language: A response to: Ding, Melloni, Tian, and Poeppel (2017). Rule-based and word-level statistics-based processing of language: insights from neuroscience. Language, Cognition and Neuroscience. [J] . Frank Stefan L., Christiansen Morten H. Language, cognition and neuroscience . 2018,第9期

机译：语言的分层和顺序处理：响应：丁，Melloni，Tian和Poeppel（2017）。基于规则和基于词语级别的语言处理：神经科学的见解。语言，认知和神经科学。
2. A Novel Approach for Gujarati Handwritten Text Lines and Words Segmentation [J] . J. V. Nasriwala, B. C. Patel National Journal of System and Information Technology . 2014,第1期

机译：古吉拉特语手写文本行和分词的一种新方法
3. Gujarati Language Speech Recognition System for Identifying Smartphone Operation Commands [J] . Jigisha K. Patel, Pritesh N. Patel, Paresh V. Virparia National Journal of System and Information Technology . 2015,第2期

机译：古吉拉特语语音识别系统，用于识别智能手机操作命令
4. A Rule-Based Approach to Identify Stop Words for Gujarati Language [C] . Rajnish M. Rakholia, Jatinderkumar R. Saini International Conference on Frontiers of Intelligent Computing : Theory and Applications . 2017

机译：基于规则的方法来识别古吉拉特语言的停止词
5. AN ANALYSIS OF WORDS SELECTED BY KINDERGARTEN AND FIRST-GRADE CHILDREN FROM LANGUAGE EXPERIENCE STORIES IN ORDER TO TEST THREE BASIC ASSUMPTIONS OF THE LANGUAGE-EXPERIENCE APPROACH TO TEACHING BEGINNING READING. [D] . MUTHLEB, VERA EVELYN PATE. 1976

机译：对幼儿园和一年级儿童从语言经验故事中选择的单词进行分析，以测试在开始阅读时对语言体验方法的三种基本假设。
6. Rule-based and Word-level Statistics-based Processing of Language: Insights from Neuroscience [O] . Nai Ding, Lucia Melloni, Xing Tian, -1

机译：基于规则和基于单词级统计的语言处理：来自神经科学的见解
7. Rule-based Approach for Arabic Root Extraction: New Rules to Directly Extract Roots of Arabic Words [O] . Fatma Abu Hawas, Keith E. Emmert 2016

机译：基于规则的阿拉伯语根提取方法：直接提取阿拉伯语词根的新规则
8. Military Typesetting Equipment and Systems for Indo-Aryan and Dravidian Languages (Hindi, Marathi, Bengali, Punjabi, Gujarati, Malayalam, Tamil, and Telugu) (1961-1963) [R] . Nitenson, E. 1964

机译：印度 - 雅利安语和德拉威语的军事排版设备和系统（印地语，马拉地语，孟加拉语，旁遮普语，古吉拉特语，马拉雅拉姆语，泰米尔语和泰卢固语）（1961-1963）

A Rule-Based Approach to Identify Stop Words for Gujarati Language

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅