首页> 中文期刊> 《数字图书馆论坛》 >基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取

基于改进TF-IDF-CHI算法的农业科技文献文本特征抽取

             

摘要

This paper is aimed at improving the lack of traditional TF-IDF method and verifying its effectiveness through text classification tests in the agricultural field. The improved method is called ImpTF_IDF_CHI which is to reconstruct the feature word weighting function by adding chi-square test values and weight correction factors. First, we use the ImpTF-IDF-CHI method, document frequency method, information gain method and the TF-IDF to perform the feature word extraction test. Then we use feature extraction words for test of text classification and judge the pros and cons based on the test. In all the test results, the best results were obtained using the ImpTF-IDF-CHI method. The Accuracy of naive Bayesian text classification using the ImpTF-IDF-CHI method is 94% and F1 value is 0.844. The experiment fully proves the effectiveness and advancement of the ImpTF-IDF-CHI method. The ImpTF-IDF-CHI method has the characteristics of high accuracy, good stability, strong subject representative in text feature extraction. This method can be applied to fields such as text categorization, feature expression and theme extraction.%针对相近农业科研领域文献的文本特征信息高度重合的特点,以及传统的文本特征抽取方法存在的不足,对TF-IDF算法进行优化并加以应用验证.通过引入卡方检验值与特征词频修正因子等方式,对特征词加权函数进行重构,形成改进的ImpTF-IDF-CHI方法.将该方法与文档频率法、信息增益法及TF-IDF 3种传统的文本特征抽取结果应用于朴素贝叶斯分类实验,根据实验结果判定方法的优劣性.通过4种方法的58组特征抽取与文本分类实验,发现与前述的3种特征抽取方法相比,ImpTF-IDF-CHI方法抽取的特征词,应用于文本分类的正确率最高,平均准确率达94%,F1值为0.844,证明该方法在对相近农业科研领域文本进行特征抽取方面,具有准确率高、稳定性好、主题词代表性强等优点,可以有效地应用于此类文献文本分类、特征表达、主题抽取等场景.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号