首页> 外文会议>International Conference on Management and Service Science >Research of Self-adaptive Web Page Parser based on Templates and Rules
【24h】

Research of Self-adaptive Web Page Parser based on Templates and Rules

机译:基于模板和规则的自适应网页解析器研究

获取原文

摘要

Web pages parsing is a concerned topic in recent years, how to get rid of human intervention and formulate extraction rules of subject information from a large number of web pages at the fastest and most accurate speed has becoming an important research point in this field. This paper proposes a frame of self-adaptive web page parser based on templates and rules. Firstly, it uses the noise filter algorithm to filter irrelevant nodes and invalid nodes, and then combines the ways of page template and heuristic rule to generate extraction rules, at the same time it can adjust extraction rules dynamically according to external factors through automatic detection mechanism. Using this frame to generate parsers has better self-adaptability, being able to generate extraction rules better, and being able to locate and extract subject information better. The experimental result shows the effectiveness of the parser.
机译:网页解析是近年来有关的主题,如何以最快,最准确的速度从大量网页摆脱人为干预并制定主题信息的提取规则,成为该领域的重要研究点。本文提出了一种基于模板和规则的自适应网页解析器的框架。首先,它使用噪声滤波器算法来过滤无关的节点和无效节点,然后将页面模板和启发式规则的方式组合以产生提取规则,同时它可以通过自动检测机制根据外部因素动态调整提取规则。使用该帧生成解析器具有更好的自适应性,能够更好地生成提取规则,并且能够更好地定位和提取主题信息。实验结果表明了解析器的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号