首页> 外文会议>IEEE International Conference on Power, Intelligent Computing and Systems >Semantic Information Detection of Webpage Based on Word Vector and Infomap
【24h】

Semantic Information Detection of Webpage Based on Word Vector and Infomap

机译:基于词向量和信息图的网页语义信息检测

获取原文

摘要

For Chinese web pages, we use regular expression and Viterbi algorithm to realize Chinese filtering and word segmentation, then use ngram2vec algorithm to get the word vector set of web page and pre train the word vector set of Baidu Encyclopedia. Baidu Encyclopedia word vector set is based on Infomap clustering algorithm to realize word vector Clustering and tagging types, training neural network through training data set and Baidu Encyclopedia corpus to determine the type of unknown web pages through neural network, and achieve the purpose of detecting the semantic information of unknown web pages. This algorithm is has few super parameters and high calculation efficiency. Experiments show that the accuracy of the trained neural network model reaches 96.73%, which can quickly and accurately identify the type of web page.
机译:对于中文网页,我们使用正则表达式和Viterbi算法来实现中文过滤和分词,然后使用ngram2vec算法获得网页的单词向量集,并预先训练百度百科的单词向量集。百度百科词向量集是基于Infomap聚类算法实现词向量的聚类和标记类型,通过训练数据集和百度百科语料库来训练神经网络,通过神经网络确定未知网页的类型,从而达到检测网页目的的目的。未知网页的语义信息。该算法超级参数少,计算效率高。实验表明,经过训练的神经网络模型的准确率达到96.73%,可以快速,准确地识别出网页的类型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号