...
首页> 外文期刊>Concurrency, practice and experience >An improved term weighting scheme for text classification
【24h】

An improved term weighting scheme for text classification

机译:一种改进的文本分类术语加权方案

获取原文
获取原文并翻译 | 示例
           

摘要

Text representation is a necessary and primary procedure in performing text classification (TC), which first needs to be obtained through an information-rich term weighting scheme to achieve higher TC performance. So far, term frequency-inverse document frequency (TF-IDF) is the most widely used term weighting scheme, but it suffers from two deficiencies. First, the global weighting factors IDF in TF-IDF approaches infinity if a certain term does not occur in a text. Second, the IDF is equal to zero if a certain term appears in any text. To offset these drawbacks, we first conduct an in-depth analysis of the current term weighting schemes, and subsequently, an improved term weighting scheme called term frequency-inverse exponential frequency (TF-IEF) and its various variants are proposed. The proposed method replaces IDF with the new global weighting factor IEF to characterize the global weighting factor log-like IDF in the corpus, which can greatly reduce the effect of feature (term) with high local weighting factor TF in term weighting. As a result, a more representative feature can be generated. We carried out a series of experiments on two commonly used data sets (corpora) utilizing Naive Bayes and support vector machine classifiers to validate the performance of our proposed schemes. Experimental results explicitly reveal that the proposed term weighting schemes come with better performance than the compared schemes.
机译:文本表示是执行文本分类(TC)的必要和主要过程,首先需要通过信息丰富的术语加权方案来获得文本分类,以实现更高的TC性能。到目前为止,术语频率逆文档频率(TF-IDF)是使用最广泛的术语加权方案,但存在两个缺陷。首先,如果文本中未出现特定术语,则TF-IDF中的全局加权因子IDF接近无穷大。其次,如果某个术语出现在任何文本中,则IDF等于零。为了弥补这些缺点,我们首先对当前的术语加权方案进行了深入分析,然后提出了一种改进的术语加权方案,称为术语频率-反指数频率(TF-IEF)及其各种变体。所提出的方法用新的全局加权因子IEF代替IDF来表征语料库中的全局加权因子log-like IDF,这可以大大降低具有高局部加权因子TF的特征(术语)对术语加权的影响。结果,可以生成更具代表性的特征。我们利用朴素贝叶斯和支持向量机分类器对两个常用数据集(语料库)进行了一系列实验,以验证我们提出的方案的性能。实验结果清楚地表明,与比较方案相比,所提出的术语加权方案具有更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号