【24h】

Applying Pattern Mining to Web Information Extraction

机译:将模式挖掘应用于Web信息提取

获取原文

摘要

Information extraction (IE) from semi-structured Web documents is a critical issue for information integration systems on the Internet. Previous work in wrapper induction aim to solve this problem by applying machine learning to automatically generate extractors. For example, WIEN, Stalker, Softmealy, etc. However, this approach still requires human intervention to provide training examples. In this paper, we propose a novel idea to IE, by repeated pattern mining and multiple pattern alignment. The discovery of repeated patterns are realized through a data structure call PAT tree. In addition, incomplete patterns are further revised by pattern alignment to comprehend all pattern instances. This new track to IE involves no human effort and content-dependent heuristics. Experimental results show that the constructed extraction rules can achieves 97 percent extraction over fourteen popular search engines.
机译:来自半结构化Web文档的信息提取(即)是Internet上信息集成系统的关键问题。以前的包装诱导的工作旨在通过应用机器学习来自动生成提取器来解决这个问题。例如,Wien,Stalker,Softmealy等,这种方法仍然需要人为干预以提供培训示例。在本文中,通过重复的模式挖掘和多种模式对准,我们向IE提出了一种新颖的想法。通过数据结构调用PAT树实现重复模式的发现。此外,图案对齐还通过模式对齐进行了不完整的模式,以了解所有模式实例。这种新曲目到IE涉及没有人类的努力和依赖内容的启发式。实验结果表明,建造的提取规则可实现97%的百分比上四百种流行的搜索引擎。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号