首页> 中文期刊> 《信息网络安全》 >基于句义成分的短文本表示方法研究

基于句义成分的短文本表示方法研究

         

摘要

With the development of mobile Internet and information technology, short text data such as commentary, microblog, has explosive growth. The sparseness of short text requires an effective algorithm of short text representation to improve the results of text clustering and classification, hot event detection and public opinion analysis, etc. This paper proposes an algorithm of short text representation based on sentential semantic components. Without changing the dimension of feature space, the method utilizes the sentential semantic components and topic model to obtain the semantic correlated words, and expands the short text with those words according to the topic selection rules. It reduces the zero-value dimension of in the text representation feature vectors. This paper implements short text classiifcation experiments based on the Sogou corpus. The results show that the accuracy of short text classiifcation reaches 0.7958, which is better than other methods. In summary, the proposed short text representation method, expanding short text with the semantic correlated words, can mitigate the sparseness problem effectively and improve the performance of short text classiifcation.%随着移动互联网和信息技术的迅速发展,评论、微博等短文本数量呈现爆炸式增长。短文本数据少,文本特征稀疏,亟需有效的短文本表示方法来提升针对短文本的文本分类、文本聚类、热点发现、舆情分析等应用的效果。针对短文本特征稀疏问题,文章提出融合句义成分的短文本表示方法。该方法考虑短文本的语义信息,在保证特征空间维度不变的同时,结合句义成分和主题模型构建语义相关词语,再利用句义结构模型的话题和述题构建主题选择判定规则,选择语义相关词语扩充到短文本中,减少短文本表示向量中的0值特征。文章基于Sogou文本分类语料库,选择3个类别数据进行文本分类实验,并利用5折交叉方法选定模型参数。结果表明,文中方法对短文本分类的精确度达到0.7958,结果优于对比的短文本表示方法。因此,利用语义相关词语丰富短文本的语义信息,能够有效缓解短文本特征稀疏问题。文中短文本表示方法可以有效提高短文本分类等具体应用效果。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号