【24h】

Chinese Web Content Extraction Based on Naieve Bayes Model

机译:基于Naive Bayes模型的中国网络内容提取

获取原文

摘要

As the web content extraction becomes more and more difficult, this paper proposes a method that using Naive Bayes Model to train the block attributes eigenvalues of web page. Firstly, this method denoising the web page, represents it as a DOM tree and divides web page into blocks, then uses Naive Bayes Model to get the probability value of the statistical feature about web blocks. At last, it extracts theme blocks to compose content of web page. The test shows that the algorithm could extract content of web page accurately. The average accuracy has reached up to 96.2%.The method has been adopted to extract content for the off-portal search of Hunan Farmer Training Website, and the efficiency is well.
机译:随着Web内容提取变得越来越困难,本文提出了一种使用Naive Bayes模型训练网页的块属性特征值的方法。首先,这种方法去噪了网页,将其表示为DOM树,将网页划分为块,然后使用Naive Bayes模型来获得关于Web块的统计功能的概率值。最后,它提取主题块以撰写网页的内容。该测试表明该算法可以准确提取网页的内容。平均准确性达到了高达96.2%。采用了方法来提取湖南农民培训网站的偏远搜索内容,效率良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号