首页> 外国专利> System and method for recognizing non-body text in webpage

System and method for recognizing non-body text in webpage

机译:网页中非正文文本的识别系统及方法

摘要

The invention discloses a system and method for recognizing the non-body text in a webpage, and relates to the field of main body extraction. The system comprises: a webpage grabber configured to grab data of all the webpages of a target website; a DOM tree construction unit configured to construct a DOM tree corresponding to each webpage of the target website; a DOM tree analysis unit configured to find out a unit text section in the webpage according to the DOM tree; a text statistics unit configured to conduct statistics on the number of occurrence of the unit text section in all the webpages of the target website; and a text recognition unit configured to recognize the unit text section as a non-body text when the number of occurrence is greater than a predetermined threshold. The system and the method overcome the problem of lag of recognition of a non-body text in the prior art method, and have a high recognition accuracy.
机译:本发明公开了一种网页中非正文文本的识别系统及方法,涉及主体提取领域。该系统包括:网页抓取器,被配置为抓取目标网站的所有网页的数据;以及DOM树构造单元,用于构造与目标网站的每个网页相对应的DOM树; DOM树分析单元,用于根据所述DOM树在所述网页中找出单元文本部分;文本统计单元,用于对所述目标网站的所有网页中单元文本部分的出现次数进行统计;文本识别单元,其配置为当出现次数大于预定阈值时,将单元文本部分识别为非正文。该系统和方法克服了现有技术方法中识别非正文文本的滞后问题,具有较高的识别精度。

著录项

  • 公开/公告号US10042827B2

    专利类型

  • 公开/公告日2018-08-07

    原文格式PDF

  • 申请/专利权人 BEIJING QIHOO TECHNOLOGY COMPANY LIMITED;

    申请/专利号US201314411013

  • 发明设计人 ZHIGANG WANG;

    申请日2013-06-09

  • 分类号G06F17/20;G06F17/22;G06F17/27;G06F17/30;

  • 国家 US

  • 入库时间 2022-08-21 13:02:58

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号