The Discoverability of the Web

机译：网络的可发现性

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can bo efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider.rnWe measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis.

机译：先前的研究强调了网络上新内容的高到达率。我们研究了爬虫可以有效发现此新内容的程度。我们的研究分为两个部分。首先，在对新内容链接的可能来源进行了完美估计的前提下，我们使用最大覆盖率公式研究发现问题的内在困难。其次，我们放宽此假设，并研究一个更现实的设置，在该设置中，算法必须使用历史统计信息来估计最有可能产生指向新内容链接的页面。我们建议一种简单的算法，该算法的性能与我们考虑的所有方法都相当。rn我们测量发现新内容的开销，该开销定义为发现一个新页面所需的平均读取次数。我们首先展示出，凭借对在何处探索与新内容的链接的完美了解，有可能发现3％的开销下所有新内容的90％，以及9％的开销下找到100％的新内容。但是无法获得完美预测的实际算法面临着更加艰巨的任务：四分之一的新内容根本无法进行有效的发现。在剩余的四分之三中，如果按月重新搜寻内容，则可能会在给定的一周内发现80％的新内容，而开销为160％。

著录项

来源
《Proceedings of the Sixteenth international world wide web conference(WWW2007)》|2007年|421-430|共10页
会议地点 Banff(CA);Banff(CA)
作者
Anirban Dasgupta; Arpita Ghosh; Ravi Kumar; Christopher Olston; Sandeep Pandey; Andrew Tomkins;
展开▼
作者单位

Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089;

Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089;

Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089;

Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089;

Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089;

Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词
crawling; discovery; set cover; max cover; greedy;

机译：爬行发现;设置封面；最大覆盖贪婪;

相似文献

外文文献
中文文献
专利

1. GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources [J] . Chih-Yuan Huang, Hao Chang ISPRS International Journal of Geo-Information . 2016,第8期

机译：GeoWeb爬网程序：用于发现地理空间Web资源的可扩展和可扩展的Web爬网框架
2. Discovering web services in social web service repositories using deep variational autoencoders [J] . Ignacio Lizarralde, Cristian Mateos, Alejandro Zunino, Information Processing & Management . 2020,第4期

机译：使用Deep变形AutoEncoders在社交Web服务存储库中发现Web服务
3. Hybrid Data Aggregation Technique to Categorize the Web Users to Discover Knowledge About the Web Users [J] . Manohar E., Punithavathani D. Shalini Wireless personal communications: An Internaional Journal . 2017,第4期

机译：混合数据聚合技术将Web用户分类以发现有关Web用户的知识
4. A Focused Crawler for Web Feature Service and Web Map Service Discovering [C] . Victor Macedo Alexandrino, Giovanni Comarela, Altigran Soares da Silva, International Symposiumin Web and Wireless Geograpical Information Systems . 2020

机译：用于Web功能服务和Web地图服务发现的一个聚焦履带
5. Discovering Galapagos: A Children's Website [D] . Yan, Jing 2015

机译：发现加拉帕戈斯群岛：儿童网站
6. Discovering aspects of health—experiences of a web‐based health diary among adults with primary immunodeficiency [O] . Christina Petersson, Janne Björkander, Ramona Fust 2018

机译：发现健康方面—患有原发性免疫缺陷的成年人基于网络的健康日记的经验
7. GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources [O] . Chih-Yuan Huang, Hao Chang 2016

机译：GeoWeb Crawler：用于发现地理空间Web资源的可扩展且可扩展的Web爬网框架

The Discoverability of the Web

摘要

著录项

相似文献

相关主题

期刊订阅