首页> 中文期刊>重庆理工大学学报(自然科学版) >一种改进的向量空间模型的文本表示算法




文本表示是将可阅读的文字转换成计算机可识别的数据结构的过程,是文本信息处理领域中关注的基础性问题.针对向量空间模型中文本表示的tf-idf算法仅考虑了词项特征与文档之间的关系,没有考虑与类别关联性的问题,引入数理统计卡方分布方法,以此改进了tf-idf算法,构成为新算法tf-idf-cθ.该算法将词项的卡方分布值c作为文本表示的一个因子,用该c值来衡量词项在文本类中分布的差异,并且引入词性因子θ,得到改进向量空间模型的表示文本.对改进前后的2个算法进行文本分类实验,结果表明:改进后的算法得到了提升,部分解决了词项特征与类别的关联性.%Text representation transfers the readable text into computer-identifiable data structure,and it is a fundamental problem in text information processing field.As a text representation approach in Vector Space Model (VSM),tf-idf algorithm just considers the relevancy between term feature and document,but class.In order to solve this problem,the paper introduce the Chi-square concept of mathematical statistics,and propose a text representation algorithm——tf-idf-cθ.And the algorithm takes the term c value as a factor of a text representation,and c value measures the term distribution difference in classes, and also considers the term characteristic as θvalue to produce the corresponding text representation based on the improved VSM.Last,it classifies short text using twoalgorithms above,and the experiment results show that the modified method is more effective,and partly solve the relevancy between term feature and class.



  • 中文文献
  • 外文文献
  • 专利


京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号