首页> 外文期刊>ACM Transactions on Information Systems >Clustering-Based Incremental Web Crawling
【24h】

Clustering-Based Incremental Web Crawling

机译:基于群集的增量Web爬网

获取原文
获取原文并翻译 | 示例

摘要

When crawling resources, for example, number of machines, crawl-time, and so on, are limited, so a crawler has to decide an optimal order in which to crawl and recrawl Web pages. Ideally, crawlers should request only those Web pages that have changed since the last crawl; in practice, a crawler may not know whether a Web page has changed before downloading it. In this arti-cle, we identify features of Web pages that are correlated to their change frequency. We design a crawling algorithm that clusters Web pages based on features that correlate to their change frequencies obtained by examining past history. The crawler downloads a sample of Web pages from each cluster, and depending upon whether a significant number of these Web pages have changed in the last crawl cycle, it decides whether to recrawl the entire cluster. To evaluate the performance of our incremental crawler, we develop an evaluation framework that measures which crawling policy results in the best search results for the end-user. We run experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 Web sites. The ? results demonstrate that the clustering-based sampling algorithm effectively clusters the pages with similar change patterns, and our clustering-based crawling algorithm utperforms existing algorithms in that it can improve the quality of the user experience for those who query the search engine.
机译:例如,在对资源进行爬网时,机器数量,爬网时间等受到限制,因此,爬网程序必须确定爬网和重新爬网网页的最佳顺序。理想情况下,搜寻器应仅请求自上次搜寻以来已更改的那些Web页面;实际上,搜寻器在下载网页之前可能不知道网页是否已更改。在本文中,我们确定了与网页更改频率相关的网页功能。我们设计了一种爬网算法,该算法基于与通过检查历史记录而获得的更改频率相关的功能对网页进行聚类。搜寻器从每个群集中下载一个网页样本,然后根据上一个搜寻周期中这些网页中是否有大量更改,来决定是否重新爬网整个群集。为了评估增量爬网程序的性能,我们开发了一个评估框架,该框架可衡量哪些爬网策略结果对于最终用户而言是最佳搜索结果。我们对一个真实的Web数据集进行了实验,该数据集由210个网站中分布的大约300,000个不同的URL组成。 ?结果表明,基于聚类的采样算法有效地对具有相似变化模式的页面进行聚类,并且基于聚类的爬网算法优于现有算法,因为它可以提高查询搜索引擎的用户的体验质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号