首页> 外国专利> METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE

METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE

机译:通过Web站点URL的层次组织对Web页面动态URL进行归一化的方法

摘要

Techniques are described for normalizing dynamic URLs using a hierarchical organization of a web site. Given web pages associated with a web site, an information extraction method is used to generate data structures that represent the content or structure of each of the web pages. These data structures are appended to the corresponding dynamic URLs. The modified URLs with the data structures are tokenized with the resulting tokens clustered to create a hierarchical organization. Nodes of the hierarchical organization may be merged based upon occurrence or patterns of content and structure. The merged hierarchical organization may then be pruned to remove irrelevant information and to reduce the memory footprint of the hierarchical organization. When a new dynamic URL is received, the new dynamic URL is matched to the hierarchical organization. Important parameters are taken into account and irrelevant information may be removed. Based upon the matching to the hierarchical organization, a normalized URL is returned.
机译:描述了使用网站的层次结构来标准化动态URL的技术。给定与网站相关联的网页,信息提取方法用于生成表示每个网页的内容或结构的数据结构。这些数据结构将附加到相应的动态URL。带有数据结构的已修改URL被标记为令牌,结果令牌被聚类以创建层次结构。可以基于内容和结构的出现或模式来合并分层组织的节点。然后可以修剪合并的层次结构组织以除去不相关的信息并减少层次结构组织的内存占用。收到新的动态URL时,会将新的动态URL与层次结构进行匹配。重要参数已考虑在内,无关的信息可能会被删除。基于与层次结构的匹配,返回归一化的URL。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号