首页> 外文期刊>Computer standards & interfaces >Application of structured document parsing to focused web crawling
【24h】

Application of structured document parsing to focused web crawling

机译:结构化文档解析在重点网页爬取中的应用

获取原文
获取原文并翻译 | 示例

摘要

The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.
机译:可以通过考虑由机器人下载的文档的结构来提高焦点或特定主题的Web机器人的性能。对于HTML,文档结构是树状的,由嵌套的文档元素(标签)及其属性定义。通过分析此结构,机器人可以使用某些HTML元素的文本来确定要下载的文档的优先级,从而显着提高收敛到主题的速度。将结构感知文档解析器与下载调度程序清楚地分开提供了灵活性,但在两者之间需要标准接口和协议。本文在实验Web机器人的环境中讨论了这样的界面,该界面观察到与主题的融合速度提高了3到8倍,这是通过下载的文档数量达到给定的平均相关性得分来衡量的。

著录项

  • 来源
    《Computer standards & interfaces》 |2011年第3期|p.325-331|共7页
  • 作者

    Ahmed Patel; Nikita Schmidt;

  • 作者单位

    School of Computer Science, Faculty of Information Science and Technology, University Kebangsaan Malaysia, The National University of Malaysia,43600 UKM Bangi, Selangor Darul Ehsan, Malaysia;

    School of Computer Science, Faculty of Information Science and Technology, University Kebangsaan Malaysia, The National University of Malaysia,43600 UKM Bangi, Selangor Darul Ehsan, Malaysia;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Information structure; Structural element; Attribute; Topic-specific; Focused web crawler; Robot; Spider;

    机译:信息结构;结构要素属性;特定主题;专注于网络爬虫;机器人;蜘蛛;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号