首页> 外文会议>International conference on database and expert systems applications;DEXA 2011 >Improving the Quality of Web Archives through the Importance of Changes*
【24h】

Improving the Quality of Web Archives through the Importance of Changes*

机译:通过更改的重要性提高网络档案的质量*

获取原文

摘要

Due to the growing importance of the Web, several archiving institutes (national libraries, Internet Archive, etc.) are harvesting sites to preserve (a part of) the Web for future generations. A major issue encountered by archivists is to preserve the quality of web archives. One way of assessing the quality of an archive is to quantify its completeness and the coherence of its page versions. Due to the large number of pages to be captured and the limitations of resources (storage space, bandwidth, etc.), it is impossible to have a complete archive (containing all the versions of all the pages). Also it is impossible to assure the coherence of all captured versions because pages are changing very frequently during the crawl of a site. Nonetheless, it is possible to maximize the quality of archives by adjusting web crawlers strategy. Our idea for that is (i) to improve the completeness of the archive by downloading the most important versions and (ii) to keep the most important versions as coherent as possible. Moreover, we introduce a pattern model which describes the behavior of the importance of pages changes over time. Based on patterns, we propose a crawl strategy to improve both the completeness and the coherence of web archives. Experiments based on real patterns show the usefulness and the effectiveness of our approach.
机译:由于Web的重要性日益提高,一些归档机构(国家图书馆,Internet档案馆等)正在收集站点以为后代保护Web(的一部分)。档案管理员遇到的一个主要问题是保持网络档案的质量。评估档案质量的一种方法是量化其完整性和页面版本的一致性。由于要捕获的页面数量众多,并且资源(存储空间,带宽等)有限,因此不可能拥有完整的归档文件(包含所有页面的所有版本)。同样,由于网页在站点爬网期间的更改非常频繁,因此也无法确保所有捕获版本的一致性。但是,可以通过调整Web搜寻器策略来最大程度地提高档案质量。我们的想法是(i)通过下载最重要的版本来提高档案的完整性,以及(ii)保持最重要的版本尽可能一致。此外,我们引入了一种模式模型,该模型描述了页面随时间变化的重要性的行为。基于模式,我们提出了一种爬网策略,以提高Web存档的完整性和连贯性。基于真实模式的实验表明了我们方法的有效性和有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号