首页> 中文期刊>计算机工程与设计 >基于改进内容分析算法的网页正文提取

基于改进内容分析算法的网页正文提取

     

摘要

针对内容分析算法,即Readability算法,在正文抽取中易丢失部分正文字段、锚文本、结构数据(表格、列表)的缺点,提出一种改进的网页正文提取算法.基于网页正文的结构特征,在原算法基础上评估非p标签节点的文本特性;引入节点相对距离过滤文本特性较强的网页噪音;重新定义剪枝范围,避免剪枝过度,使Readability算法的正文内部信息丢失问题得到较好地的改善.对国内各大博客、新闻、科普、专业类网站进行正文提取实验,实验结果表明,该算法结果优于Readability算法,正文提取准确率达到95%以上.%An improved web content extraction algorithm was proposed to solve the loss of partial text fields,anchor text,structure data (tables,lists) of the content analysis algorithm,namely the Readability algorithm.Based on the structure characteristics of web pages,the text characteristics of non-p tag nodes were evaluated on the basis of the original algorithm.The relative distance of nodes was adopted to filter the text characteristics of the strong web page noise.The scope of pruning was redefined to avoid over-pruning.Hence,the loss of internal information of the text in the Readability algorithm was reduced.Experimental results show that the proposed algorithm is better than the Readability algorithm,and the accuracy rate of content extraction is above 95%.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号