首页> 外文期刊>Expert Systems with Application >Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms
【24h】

Novel term weighting schemes for document representation based on ranking of terms and Fuzzy logic with semantic relationship of terms

机译:基于术语排序和含术语语义关系的模糊逻辑的文档表示新术语加权方案

获取原文
获取原文并翻译 | 示例
       

摘要

Weighting and normalization are the most important factor that may affect the text representation significantly. This paper presents two novel term weighting schemes to represent text documents, namely, i). Term-weighting scheme for document representation based on Term Frequency - Ranking of Term Frequency (TF-RTF) and ii). Term-weighting scheme for document representation based on Term Frequency - Ranking of fuzzy logic with semantic relationship of terms (TF-RFST). The ranking of each term in a document provides its priority of the document and uses these priorities for document representation in TF-RTF. In TF-RFST, each term is represented based on its frequency and the frequency of semantic related terms for that term. Hence, the ranking of each term is based on the combined frequencies of the term and its semantic related terms with a specific weighting scheme. With appropriate weighting schemes such as TF-RFT and TF-RFST, the proposed methods provide better clustering performance in terms of accuracy, entropy, recall and F-Measure than previously suggested methods, such as word count, Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency-Inverse Corpus Frequency (TF-ICF), Multi Aspect TF (MATF), BM25 and BM25F. Experiments carried out on the Reuters-8, Reuters-52 and WebKB data sets with K-means and K-means++ clustering algorithms for demonstrate the effectiveness of the proposed term weighting schemes. (C) 2019 Elsevier Ltd. All rights reserved.
机译:加权和规范化是最可能影响文本表示的最重要因素。本文提出了两种新颖的术语加权方案来表示文本文档,即i)。基于术语频率-术语频率排名(TF-RTF)和ii)的文档表示术语加权方案。基于术语频率的文档表示术语加权方案-具有术语语义关系的模糊逻辑排序(TF-RFST)。文档中每个术语的排名提供了文档的优先级,并将这些优先级用于TF-RTF中的文档表示。在TF-RFST中,每个术语都是基于其频率以及该术语的语义相关术语的频率来表示的。因此,每个术语的排名基于该术语及其语义相关术语与特定加权方案的组合频率。借助适当的加权方案(例如TF-RFT和TF-RFST),与以前建议的方法(例如字数,词频-逆文档频率)相比,所提出的方法在准确性,熵,召回率和F-Measure方面提供更好的聚类性能TF-IDF),术语频率-逆语料库频率(TF-ICF),多方面TF(MATF),BM25和BM25F。在带有K-means和K-means ++聚类算法的Reuters-8,Reuters-52和WebKB数据集上进行的实验,证明了所提出的术语加权方案的有效性。 (C)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号