首页> 外国专利> WEB PAGE CLASSIFICATION BASED ON NOISE REMOVAL

WEB PAGE CLASSIFICATION BASED ON NOISE REMOVAL

机译:基于去噪的网页分类

摘要

Systems and methods for improving accuracy of web content classification by removing perceived noise are provided. The system receives a Uniform Resource Locator (URL) of a web page that needs to be classified, and parses the web page so as to construct a tree containing a list of tags. Unwanted tags are removed from the list of tags to yield a tree containing only desired tags that form part of the web page. Subsequently, a list of hyperlinks are based on processing of the tree having desired tags, wherein the list of hyperlinks can include unwanted/undesired/invalid hyperlinks and valid hyperlinks. Unwanted hyperlinks can accordingly be removed from the list of hyperlinks, and each valid hyperlink can be categorized based on a list of categories, and a final category for the web page is determined based on a vector analysis of each category assigned to each valid hyperlink.
机译:提供了用于通过去除感知到的噪声来提高网络内容分类的准确性的系统和方法。该系统接收需要分类的网页的统一资源定位符(URL),并解析该网页,以构建包含标签列表的树。从标签列表中删除不需要的标签,以生成仅包含所需标签的树,该标签构成了网页的一部分。随后,超链接列表基于具有期望标签的树的处理,其中,超链接列表可以包括不想要的/不想要的/无效的超链接和有效的超链接。可以从超链接列表中删除不需要的超链接,并且可以基于类别列表对每个有效的超链接进行分类,并且基于对分配给每个有效超链接的每个类别的矢量分析来确定网页的最终类别。

著录项

  • 公开/公告号US2018025012A1

    专利类型

  • 公开/公告日2018-01-25

    原文格式PDF

  • 申请/专利权人 FORTINET INC.;

    申请/专利号US201615214245

  • 发明设计人 XIPING CAO;YE MA;

    申请日2016-07-19

  • 分类号G06F17/30;H04L29/08;

  • 国家 US

  • 入库时间 2022-08-21 13:01:30

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号