首页> 外文期刊>Knowledge and Data Engineering, IEEE Transactions on >Removing DUST Using Multiple Alignment of Sequences
【24h】

Removing DUST Using Multiple Alignment of Sequences

机译:使用多个序列比对去除DUST

获取原文
获取原文并翻译 | 示例

摘要

A large number of URLs collected by web crawlers correspond to pages with duplicate or near-duplicate contents. To crawl, store, and use such duplicated data implies a waste of resources, the building of low quality rankings, and poor user experiences. To deal with this problem, several studies have been proposed to detect and remove duplicate documents without fetching their contents. To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. In this work, we present DUSTER, a new approach to derive quality rules that take advantage of a multi-sequence alignment strategy. We demonstrate that a full multi-sequence alignment of URLs with duplicated content, before the generation of the rules, can lead to the deployment of very effective rules. By evaluating our method, we observed it achieved larger reductions in the number of duplicate URLs than our best baseline, with gains of 82 and 140.74 percent in two different web collections.
机译:网络搜寻器收集的大量URL对应于内容重复或几乎重复的页面。爬网,存储和使用这种重复的数据意味着浪费资源,建立低质量的排名以及不良的用户体验。为了解决这个问题,已经提出了一些研究来检测和删除重复的文档而不获取其内容。为此,建议的方法学习规范化规则,以将所有重复的URL转换为相同的规范形式。此策略的一个挑战性方面是推导一组通用和精确的规则。在这项工作中,我们介绍了DUSTER,这是一种利用多序列比对策略来推导质量规则的新方法。我们证明,在生成规则之前,具有重复内容的URL的完全多序列对齐可以导致部署非常有效的规则。通过评估我们的方法,我们观察到与我们的最佳基准相比,它减少了重复URL的数量,在两个不同的Web集合中分别获得了82%和140.74%的收益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号