首页> 外文期刊>Knowledge-Based Systems >A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
【24h】

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

机译:一种使用Lexicon和Corpus统计的基于无监督的语料库的茎秆技术

获取原文
获取原文并翻译 | 示例

摘要

Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient corpus without using any language related rules. In this article, we proposed a fully unsupervised language-independent text stemming technique that clusters morphologically related words from the corpus of the language using both lexical and co-occurrence features such as lexical similarity, suffix knowledge, and co-occurrence similarity. The method applies to a wide range of inflectional languages as it identifies morphological variants formed through different linguistic processes such as affixation, compounding, conversion, etc.The proposed approach has been tested in Information Retrieval application for four languages (English, Marathi, Hungarian, and Bengali) using standard TREC, CLEF, and FIRE test collections. A significant improvement over word-based retrieval, five other corpus-based stemmers, and rule-based stemmers has been achieved in all the languages. Besides, information retrieval, the proposed approach has also been tested in text classification and inflection removal tasks. Our algorithm excelled over other baseline methods in all the test scenarios. Thus, we successfully achieved the objective of developing a multipurpose stemming algorithm that cannot only be used for information retrieval task but also for non-traditional tasks such as text classification, sentiment analysis, inflection removal, etc. (C) 2019 Elsevier B.V. All rights reserved.
机译:Word Stemming是自然语言处理,信息检索和语言建模领域的广泛使用的机制。语言无关的SENTMERS在不使用任何语言相关规则的情况下发现来自环境语料库的形态相关词语。在本文中,我们提出了一种完全无监督的语言 - 独立文本词干技术,可以使用诸如词汇相似性,后缀知识和共同发生的语言的语言语料库中的形态相关词语。该方法适用于各种折对语言,因为它识别通过不同语言过程形成的形态变体,如附加,复合,转换等。所提出的方法已经在信息检索应用中进行了四种语言(英语,Marathi,Hungarian,和孟加拉语)使用标准TREC,CLEF和FIRE测试收集。在所有语言中都已经实现了基于文字的检索,五个基于语料库的侦察员和基于规则的犹太人的显着改善。此外,信息检索,所提出的方法也在文本分类和拐点拆除任务中进行了测试。我们的算法在所有测试场景中提供了其他基线方法。因此,我们成功实现了开发多种催眠算法的目的,该算法不能用于信息检索任务,而且还用于非传统任务,例如文本分类,情感分析,拐点等等(C)2019 Elsevier BV所有权利预订的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号