【24h】

De-duping URLs via Rewrite Rules

机译:通过重写规则对URL进行重复数据删除

获取原文

摘要

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.
机译:网络上的大部分URL包含重复(或几乎重复)的内容。 URL重复数据删除对于搜索引擎而言是一个极为重要的问题,因为搜索引擎的所有主要功能(包括爬网,索引编制,排名和表示)都会受到重复URL的不利影响。传统上,通过获取和检查URL的内容来解决重复数据删除问题。我们这里的方法是不同的。给定一组基于内容划分为等价类的URL(相同等价类中的URL具有相似的内容),我们解决了挖掘该集合并学习将等价类的所有URL转换为相同规范的URL重写规则的问题。形式。然后,可以应用这些重写规则来消除在抓取过程中首次遇到的URL之间的重复,即使不获取其内容也是如此。 为了表达这种转换规则,我们提出了一个简单的框架,该框架足够通用以捕获网络上最常见的URL重写模式。特别是,它封装了DUST(具有相似文本的不同URL)框架[5]。我们提供了一种用于挖掘和学习URL重写规则的有效算法,并表明在温和的假设下它是完整的,即,对于适当的正确性概念,我们的算法将学习每条正确的URL重写规则。通过执行各种广泛的大规模实验,我们证明了我们框架的表现力和算法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号