De-duping URLs via Rewrite Rules

机译：通过重写规则对URL进行重复数据删除

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

A large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and presentation, are adversely impacted by the presence of duplicate URLs. Traditionally, the de-duping problem has been addressed by fetching and examining the content of the URL; our approach here is different. Given a set of URLs partitioned into equivalence classes based on the content (URLs in the same equivalence class have similar content), we address the problem of mining this set and learning URL rewrite rules that transform all URLs of an equivalence class to the same canonical form. These rewrite rules can then be applied to eliminate duplicates among URLs that are encountered for the first time during crawling, even without fetching their content.In order to express such transformation rules, we propose a simple framework that is general enough to capture the most common URL rewrite patterns occurring on the web; in particular, it encapsulates the DUST (Different URLs with similar text) framework [5]. We provide an efficient algorithm for mining and learning URL rewrite rules and show that under mild assumptions, it is complete, i.e., our algorithm learns every URL rewrite rule that is correct, for an appropriate notion of correctness. We demonstrate the expressiveness of our framework and the effectiveness of our algorithm by performing a variety of extensive large-scale experiments.

机译：网络上的大部分URL包含重复（或几乎重复）的内容。 URL重复数据删除对于搜索引擎而言是一个极为重要的问题，因为搜索引擎的所有主要功能（包括爬网，索引编制，排名和表示）都会受到重复URL的不利影响。传统上，通过获取和检查URL的内容来解决重复数据删除问题。我们这里的方法是不同的。给定一组基于内容划分为等价类的URL（相同等价类中的URL具有相似的内容），我们解决了挖掘该集合并学习将等价类的所有URL转换为相同规范的URL重写规则的问题。形式。然后，可以应用这些重写规则来消除在抓取过程中首次遇到的URL之间的重复，即使不获取其内容也是如此。为了表达这种转换规则，我们提出了一个简单的框架，该框架足够通用以捕获网络上最常见的URL重写模式。特别是，它封装了DUST（具有相似文本的不同URL）框架[5]。我们提供了一种用于挖掘和学习URL重写规则的有效算法，并表明在温和的假设下它是完整的，即，对于适当的正确性概念，我们的算法将学习每条正确的URL重写规则。通过执行各种广泛的大规模实验，我们证明了我们框架的表现力和算法的有效性。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|168-175|共8页
会议地点
作者
Anirban Dasgupta; Ravi Kumar; Amit Sasturkar;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
URL normalization; de-duping; rewrite rules;

机译：URL规范化;重复数据删除;重写规则;

相似文献

外文文献
中文文献
专利

1. De-duping URLs with Sequence-to-Sequence Neural Networks [J] . Keyang Xu, Zhengzhong Liu, Jamie Callan ACM SIGIR FORUM . 2017,第cd期

机译：使用序列到序列神经网络对URL进行重复数据删除
2. Along with rewriting URLs, it may be necessary to rewrite HTTP request or response header fields [J] . Linux Journal . 2017,第275期

机译：除了重写URL，可能还需要重写HTTP请求或响应头字段
3. /apache/create sexy urls with mod_rewrite [J] . Rik Lomas Net . 2006,第152期

机译：/ apache /使用mod_rewrite创建性感的网址
4. De-duping URLs via rewrite rules [C] . Anirban Dasgupta, Ravi Kumar, Amit Sasturkar ACM SIGKDD international conference on Knowledge discovery and data mining . 2008

机译：通过重写规则对URL进行重复数据删除
5. Rewriting the Rules: Economic Inequality, Writing Classes, and the Value of Electracy [D] . Hanzalik, Katherine. 2017

机译：重写规则：经济不平等，写作课程和金属价值
6. Rewriting the rules for care of MDS and AML patients in the time of COVID-19 [O] . Azra Raza, Amer Assal, Abdullah M. Ali, 2020

机译：在COVID-19期间重写MDS和AML患者的护理规则
7. Behavioral and Coinductive Rewriting (invited talk)11The research reported in this paper has been supported in part by National Science Foundation grant CCR-9901002, and by the CafeOBJ project of the Information Promotion Agency (IPA), Japan, as part of its Advanced Software Technology Program.Note: all papers by the authors can be found on their websites, which respectively have the URLs http://www.ucsd.edu/users/{goguen, klin, grosu}. More information on the BOBJ system can be found at http://www.ucsd.edu/groups/tatami/bobj/.Note: all papers by the authors can be found on their websites, which respectively have the URLs www.ucsd.edu/users/{goguen, klin, grosu}. More information on the BOBJ system can be found at www.ucsd.edu/groups/tatami/bobj/. [O] . Goguen Joseph, Lin Kai, Roşu Grigore 2000

机译：行为和归纳重写（特邀演讲）11本文报道的研究得到了美国国家科学基金会（National Science Foundation）资助CCR-9901002以及日本信息促进局（IPA）的CafeOBJ项目的部分支持，这是其高级软件技术计划的一部分。可以在他们的网站（网址分别为http://www.ucsd.edu/users/{goguen，klin，grosu}）上找到作者。有关BOBJ系统的更多信息，请访问http://www.ucsd.edu/groups/tatami/bobj/。注意：作者的所有论文都可以在其网站上找到，它们的网址分别为www.ucsd。 edu / users / {goguen，klin，grosu}。有关BOBJ系统的更多信息，请访问www.ucsd.edu/groups/tatami/bobj/。
8. Rewrite Rule Machine. Models of Computation for the Rewrite Rule Machine [R] . Goguen, J., Kirchner, C., Meseguer, J. 1986

机译：重写规则机器。重写规则机的计算模型

De-duping URLs via Rewrite Rules

摘要

著录项

相似文献

相关主题

期刊订阅