Lost but not forgotten: finding pages on the unarchived web

Hugo C. Huurdeman; Jaap Kamps; Thaer Samar; Arjen P. Vries; Anat Ben-David; Richard A. Rogers

首页> 外文期刊>International Journal on Digital Libraries >Lost but not forgotten: finding pages on the unarchived web

【24h】

Lost but not forgotten: finding pages on the unarchived web

机译：丢失但未被遗忘：在未存档的网络上查找页面

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.

机译：Web档案试图保留快速变化的Web，但是它们总是不完整的。由于对爬网深度，爬网频率和限制性选择策略的限制，Web的大部分内容都未归档，因此丢失了后代。在本文中，我们提出了一种方法，用于根据未爬网的页面集中的链接和锚文本来发现未归档的网页和网站，并为这些页面和网站重构不同类型的描述。我们在Dutch Web Archive上试验了这种方法，并评估了未归档内容的页面和主机级别表示的有用性。我们的主要发现如下：首先，爬行的Web包含大量未归档页面和网站的证据，有可能显着增加Web档案的覆盖范围。其次，链接和锚文本的分布高度偏斜：诸如主页之类的流行页面具有指向它们的更多链接和锚文本中的更多术语，但是丰富度会迅速下降。将网页证据汇总到主机级别可以得到更丰富的表示，但是分布仍然偏斜。第三，简洁的表示形式通常足够丰富，可以唯一地标识未归档网页上的页面：在已知项目的搜索设置中，我们可以平均检索第一等级内的未归档网页，而主机级表示形式则可以进一步改善检索结果网站的有效性。

著录项

来源
《International Journal on Digital Libraries》 |2015年第4期|247-265|共19页
作者
Hugo C. Huurdeman; Jaap Kamps; Thaer Samar; Arjen P. Vries; Anat Ben-David; Richard A. Rogers;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Web archives; Web archiving; Web crawlers; Anchor text; Link evidence; Information retrieval;

机译：网络档案;网络归档;网络爬虫;锚文字;链接证据;信息检索;

相似文献

外文文献
中文文献
专利

1. Lost but not forgotten: finding pages on the unarchived web [J] . Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, International journal on digital libraries . 2015,第3a4期

机译：丢失但未被遗忘：在未存档的网络上查找页面
2. Lost but not forgotten: A population-based study of mortality and care trajectories among people living with HIV who are lost to follow-up in Ontario, Canada [J] . Kendall C. E., Raboud J., Donelle J., HIV medicine . 2019,第2期

机译：丢失但没有忘记：一种基于人口的死亡率和护理轨迹研究，患有艾滋病毒的人们在加拿大安大略省失去后续行动
3. Lost but not forgotten: A population-based study of mortality and care trajectories among people living with HIV who are lost to follow-up in Ontario, Canada [J] . Kendall C. E., Raboud J., Donelle J., Nature reviews neuroscience . 2019,第2期

机译：丢失但没有忘记：一种基于人口的死亡率和护理轨迹研究，患有艾滋病毒的人们在加拿大安大略省失去后续行动
4. Finding pages on the unarchived Web [C] . Huurdeman H.C., Ben-David A., Kamps J., 2014 IEEE/ACM Joint Conference on Digital Libraries . 2014

机译：在未归档的Web上查找页面
5. Forgotten: Scioto County's lost Black history. [D] . Jenkins, Rebecca D. 2015

机译：被遗忘的：Scioto县的失去的黑人历史。
6. Lost but not forgotten: patients lost to follow-up in a trauma database [O] . M. Lucas Murnaghan, Richard E. Buckley 2002

机译：遗失但未被遗忘：患者在创伤数据库中失去了随访
7. Lost but not forgotten: finding pages on the unarchived web [O] . Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, 2015

机译：丢失但未被遗忘：在未存档的网络上查找页面

Lost but not forgotten: finding pages on the unarchived web

摘要

著录项

相似文献

相关主题

期刊订阅