【24h】

The Discoverability of the Web

机译:网络的可发现性

获取原文
获取原文并翻译 | 示例

摘要

Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can bo efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider.rnWe measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis.
机译:先前的研究强调了网络上新内容的高到达率。我们研究了爬虫可以有效发现此新内容的程度。我们的研究分为两个部分。首先,在对新内容链接的可能来源进行了完美估计的前提下,我们使用最大覆盖率公式研究发现问题的内在困难。其次,我们放宽此假设,并研究一个更现实的设置,在该设置中,算法必须使用历史统计信息来估计最有可能产生指向新内容链接的页面。我们建议一种简单的算法,该算法的性能与我们考虑的所有方法都相当。rn我们测量发现新内容的开销,该开销定义为发现一个新页面所需的平均读取次数。我们首先展示出,凭借对在何处探索与新内容的链接的完美了解,有可能发现3%的开销下所有新内容的90%,以及9%的开销下找到100%的新内容。但是无法获得完美预测的实际算法面临着更加艰巨的任务:四分之一的新内容根本无法进行有效的发现。在剩余的四分之三中,如果按月重新搜寻内容,则可能会在给定的一周内发现80%的新内容,而开销为160%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号