首页> 外文期刊>World Wide Web >Extracting Web Data Using Instance-Based Learning
【24h】

Extracting Web Data Using Instance-Based Learning

机译:使用基于实例的学习提取Web数据

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances may not be representative of all other instances. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process to extract the required data items from the new instance. The technique is also very efficient. Experimental results based on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art existing systems significantly.
机译:本文研究了从网页中提取结构化数据的方法。现有的数据提取方法包括包装器归纳法和自动化方法。在本文中,我们提出了一种基于实例的学习方法,该方法通过将要提取的每个新实例与带标签的实例进行比较来执行提取。我们方法的主要优势在于,它不需要像包装归纳法那样就需要一组初始的标记页面来学习提取规则。取而代之的是,该算法能够开始从单个标记实例中提取。仅当无法提取新实例时,才需要标记。这避免了不必要的页面标记,这解决了归纳学习(或包装器归纳)的主要问题,即,标记实例的集合可能不代表所有其他实例。基于实例的方法非常自然,因为Web上的结构化数据通常遵循某些固定模板。通常可以基于模板的单个页面实例来提取相同模板的页面。提出了一种新颖的技术来将新实例与手动标记的实例匹配,并在此过程中从新实例中提取所需的数据项。该技术也非常有效。基于来自24个不同网站的1200页的实验结果证明了该方法的有效性。它还大大优于现有的现有系统。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号