首页> 外文会议>International Conference on Theory and Practice of Digital Libraries >Identifying 'Soft 404' Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections
【24h】

Identifying 'Soft 404' Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections

机译:识别“软404”错误页面:分析分布式集合中文档的词汇签名

获取原文

摘要

Collections of Web-based resources are often decentralized; leaving the task of identifying and locating removed resources to collection managers who must rely on http response codes. When a resource is no longer available, the server is supposed to return a 404 error code. In practice and to be friendlier to human readers, many servers respond with a 200 OK code and indicate in the text of the response that the document is no longer available. In the reported study, 3.41% of servers respond in this manner. To help collection managers identify these "friendly" or "soft" 404s, we developed two methods that use a Naive Bayes classifier based on known valid responses and known 404 responses. The classifier was able to predict soft 404 pages with a precision of 99% and a recall of 92%. We will also elaborate on the results obtained from our study and will detail the lessons learned.
机译:基于网络资源的集合通常是分散的;离开必须依赖HTTP响应代码的收集管理员识别和定位删除资源的任务。当资源不再可用时,服务器应返回404错误代码。在实践中并与人类读者更友好,许多服务器用200个确定代码响应,并在响应的文本中指出文档不再可用的响应文本。在报告的研究中,3.41%的服务器以这种方式响应。为了帮助收集经理识别这些“友好”或“软”404S,我们开发了两种方法,这些方法使用了基于已知的有效响应和已知的404响应的幼稚贝叶斯分类器。分类器能够预测软404页,精度为99%,召回为92%。我们还将详细阐述从我们的研究中获得的结果,并将详细说明经验教训。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号