首页> 外文期刊>International Journal on Digital Libraries >Lost but not forgotten: finding pages on the unarchived web
【24h】

Lost but not forgotten: finding pages on the unarchived web

机译:丢失但未被遗忘:在未存档的网络上查找页面

获取原文
获取原文并翻译 | 示例
           

摘要

Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
机译:Web档案试图保留快速变化的Web,但是它们总是不完整的。由于对爬网深度,爬网频率和限制性选择策略的限制,Web的大部分内容都未归档,因此丢失了后代。在本文中,我们提出了一种方法,用于根据未爬网的页面集中的链接和锚文本来发现未归档的网页和网站,并为这些页面和网站重构不同类型的描述。我们在Dutch Web Archive上试验了这种方法,并评估了未归档内容的页面和主机级别表示的有用性。我们的主要发现如下:首先,爬行的Web包含大量未归档页面和网站的证据,有可能显着增加Web档案的覆盖范围。其次,链接和锚文本的分布高度偏斜:诸如主页之类的流行页面具有指向它们的更多链接和锚文本中的更多术语,但是丰富度会迅速下降。将网页证据汇总到主机级别可以得到更丰富的表示,但是分布仍然偏斜。第三,简洁的表示形式通常足够丰富,可以唯一地标识未归档网页上的页面:在已知项目的搜索设置中,我们可以平均检索第一等级内的未归档网页,而主机级表示形式则可以进一步改善检索结果网站的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号