首页> 外文期刊>Expert Systems with Application >Predicate enrichment of aligned XPaths for wrapper induction
【24h】

Predicate enrichment of aligned XPaths for wrapper induction

机译:对齐的XPath的谓词丰富化,用于包装器归纳

获取原文
获取原文并翻译 | 示例

摘要

Extracting data from various semi-structured sources is a topic that has received a lot of attention. Wrapper induction specifically has been studied extensively, where users annotate a couple of data sources with examples of the data they want, after which a procedure (wrapper) is constructed that can optimally extract similar data as well. In this paper a novel wrapper induction approach is proposed, exploiting the premise of the general applicability of the XPath query language, studied specifically within the context of web pages. After a user annotates a limited set of web pages with the required data, a generalised XPath is constructed that is capable of extracting the examples and, optimally, similar data as well. This generalised baseline XPath is then enriched with predicates, based on context and structure of the data sources, to optimise the precision/recall balance of the data extraction capability of the wrapper. Six variations of such limiting predicates are introduced and investigated. In this paper, it is shown that the baseline approach often generalises the samples too much, leading to a decreased precision. Enriching the baseline wrapper by the addition of predicates limits the generalisation power of the queries in an intelligent manner. Experimental results show that there is a significant improvement in the overall precision of the generalised query, without an excessive loss in recall. Documented tests and real world experience with a large amount of data show that the technique is flexible, easily understood and applicable in a broad range of applications. It is not only of interest in the fields of web information retrieval, but can also be used in the contexts of, e.g., reverse engineering of databases, ontology expansion and deep web data mining, as both simple lists of data and complex structures can be extracted. (C) 2016 Elsevier Ltd. All rights reserved.
机译:从各种半结构化数据源中提取数据是一个备受关注的话题。特别是对包装器归纳方法进行了广泛的研究,用户在其中用所需的数据示例注释几个数据源,然后构造一个可以最佳地提取相似数据的过程(包装器)。在本文中,提出了一种新颖的包装器归纳方法,该方法利用了XPath查询语言的普遍适用性的前提,特别是在网页上下文中进行了研究。在用户使用所需的数据注释了一组有限的网页之后,便构建了一个通用的XPath,它能够提取示例以及最佳的相似数据。然后,基于数据源的上下文和结构,该通用基线XPath会充斥谓词,以优化包装器数据提取功能的精度/调用平衡。引入并研究了这种限制谓词的六种变体。在本文中,表明基线方法经常将样本泛化得太多,从而导致精度降低。通过添加谓词来丰富基线包装,以一种智能的方式限制了查询的泛化能力。实验结果表明,广义查询的整体精度有了显着提高,而召回率没有过多损失。有记录的测试和大量数据的真实经验表明,该技术灵活,易于理解,可广泛应用于各种应用中。它不仅在Web信息检索领域中令人感兴趣,而且还可以用于例如数据库的逆向工程,本体扩展和深度Web数据挖掘的上下文中,因为可以同时使用简单的数据列表和复杂的结构提取。 (C)2016 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号