Predicate enrichment of aligned XPaths for wrapper induction

Nielandt Joachim; Bronselaer Antoon; de Tre Guy

首页> 外文期刊>Expert Systems with Application >Predicate enrichment of aligned XPaths for wrapper induction

【24h】

Predicate enrichment of aligned XPaths for wrapper induction

机译：对齐的XPath的谓词丰富化，用于包装器归纳

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Extracting data from various semi-structured sources is a topic that has received a lot of attention. Wrapper induction specifically has been studied extensively, where users annotate a couple of data sources with examples of the data they want, after which a procedure (wrapper) is constructed that can optimally extract similar data as well. In this paper a novel wrapper induction approach is proposed, exploiting the premise of the general applicability of the XPath query language, studied specifically within the context of web pages. After a user annotates a limited set of web pages with the required data, a generalised XPath is constructed that is capable of extracting the examples and, optimally, similar data as well. This generalised baseline XPath is then enriched with predicates, based on context and structure of the data sources, to optimise the precision/recall balance of the data extraction capability of the wrapper. Six variations of such limiting predicates are introduced and investigated. In this paper, it is shown that the baseline approach often generalises the samples too much, leading to a decreased precision. Enriching the baseline wrapper by the addition of predicates limits the generalisation power of the queries in an intelligent manner. Experimental results show that there is a significant improvement in the overall precision of the generalised query, without an excessive loss in recall. Documented tests and real world experience with a large amount of data show that the technique is flexible, easily understood and applicable in a broad range of applications. It is not only of interest in the fields of web information retrieval, but can also be used in the contexts of, e.g., reverse engineering of databases, ontology expansion and deep web data mining, as both simple lists of data and complex structures can be extracted. (C) 2016 Elsevier Ltd. All rights reserved.

机译：从各种半结构化数据源中提取数据是一个备受关注的话题。特别是对包装器归纳方法进行了广泛的研究，用户在其中用所需的数据示例注释几个数据源，然后构造一个可以最佳地提取相似数据的过程（包装器）。在本文中，提出了一种新颖的包装器归纳方法，该方法利用了XPath查询语言的普遍适用性的前提，特别是在网页上下文中进行了研究。在用户使用所需的数据注释了一组有限的网页之后，便构建了一个通用的XPath，它能够提取示例以及最佳的相似数据。然后，基于数据源的上下文和结构，该通用基线XPath会充斥谓词，以优化包装器数据提取功能的精度/调用平衡。引入并研究了这种限制谓词的六种变体。在本文中，表明基线方法经常将样本泛化得太多，从而导致精度降低。通过添加谓词来丰富基线包装，以一种智能的方式限制了查询的泛化能力。实验结果表明，广义查询的整体精度有了显着提高，而召回率没有过多损失。有记录的测试和大量数据的真实经验表明，该技术灵活，易于理解，可广泛应用于各种应用中。它不仅在Web信息检索领域中令人感兴趣，而且还可以用于例如数据库的逆向工程，本体扩展和深度Web数据挖掘的上下文中，因为可以同时使用简单的数据列表和复杂的结构提取。（C）2016 Elsevier Ltd.保留所有权利。

著录项

来源
《Expert Systems with Application》 |2016年第6期|259-275|共17页
作者
Nielandt Joachim; Bronselaer Antoon; de Tre Guy;
展开▼
作者单位

Univ Ghent, Dept Telecommun & Informat Proc, St Pietersnieuwstr 41, B-9000 Ghent, Belgium;

Univ Ghent, Dept Telecommun & Informat Proc, St Pietersnieuwstr 41, B-9000 Ghent, Belgium;

Univ Ghent, Dept Telecommun & Informat Proc, St Pietersnieuwstr 41, B-9000 Ghent, Belgium;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
List alignment; Xpath alignment; Predicate enrichment; Wrapper induction;

机译：列表对齐;Xpath对齐;谓词丰富;包装器归纳;

相似文献

外文文献
中文文献
专利

1. Evaluation of XPath queries with predicates: an Eulerian cycle theory based sequencing approach [J] . Yun Shen, Ling Feng International Journal of Computer Systems Science & Engineering . 2011,第4期

机译：带有谓词的XPath查询评估：基于欧拉循环理论的排序方法
2. Answering XPath queries with search predicates in structured P2P networks [J] . Weimin He, Leonidas Fegaras International Journal of Computer Systems Science & Engineering . 2008,第2期

机译：在结构化P2P网络中使用搜索谓词回答XPath查询
3. Enrichments of Boolean algebras by Presburger predicates [J] . Derakhshan Jamshid, Macintyre Angus Fundamenta Mathematicae . 2017,第1期

机译：BEPBURGER谓词的BOOLEAN代数丰富
4. Wrapper Induction by XPath Alignment [C] . Joachim Nielandt, Robin de Mol, Antoon Bronselaer, International Conference on Knowledge Discovery and Information Retrieval . 2014

机译：XPath对齐的包装器诱导
5. Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction. [D] . Packer, Thomas L. 2014

机译：使用半监督和无监督主动包装诱导，可扩展地检测和提取OCRed文本中本体列表中的数据。
6. SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters [O] . Chunlin Wang, Elliot J Lefkowitz 2004

机译：SS-Wrapper：用于在Linux群集上进行相似性搜索的包装器应用程序包
7. Stream Processing of XPath Queries with Predicates [O] . Dan Suciu, See Profile, Ashish Kumar Gupta 2016

机译：使用谓词流处理Xpath查询
8. Using the XPATHS computer code. [R] . Cable, G. D. 1989

机译：使用XpaTHs计算机代码。

Predicate enrichment of aligned XPaths for wrapper induction

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅