NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

机译：NEXIR：一种针对三阶段Web数据提取模型的新颖Web提取规则语言

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As the most popular information publishing platform, the Web contains a lot of valued data information of interests to users or applications. Nowadays, although a lot of data mining or analysis techniques have been studied in last decade, there are still not many easy-to-use web data mining tools available for users to extract useful data information from the Web. The web information extraction is a whole process involving web page navigation, data extraction and data integration. Unfortunately most of existing studies or systems lack of sufficient consideration toward the three-stage process. Also most of them lack the powerful rules to express the flexible extraction logic to extract data records with complicate structure. In this paper, we propose a novel web data extraction language, NEXIR, toward a three-stage web data extraction model. First of all, the language can define rules for system to automate the navigation process of the web pages, including deep web pages that need interactions from users. Then the language allows users to define flexible and complicated rules to extract data records from web pages and integrate extracted data into a pre-defined structure. A language engine and a prototype extraction system have been implemented based on the proposed language. The experimental results show that our language and system work effective and powerful compared with existing data extraction approaches.

机译：作为最流行的信息发布平台，Web包含许多用户或应用程序感兴趣的有价值的数据信息。如今，尽管在过去的十年中研究了许多数据挖掘或分析技术，但仍然没有太多易于使用的Web数据挖掘工具可用于用户从Web提取有用的数据信息。 Web信息提取是一个涉及Web页面导航，数据提取和数据集成的全过程。不幸的是，大多数现有的研究或系统都缺乏对三阶段过程的充分考虑。他们中的大多数人也缺乏强大的规则来表达灵活的提取逻辑来提取结构复杂的数据记录。在本文中，我们针对三阶段Web数据提取模型提出了一种新颖的Web数据提取语言NEXIR。首先，该语言可以为系统定义规则以使网页的导航过程自动化，包括需要用户交互的深层网页。然后，该语言允许用户定义灵活而复杂的规则，以从网页提取数据记录并将提取的数据集成到预定义的结构中。基于所提出的语言已经实现了语言引擎和原型提取系统。实验结果表明，与现有的数据提取方法相比，我们的语言和系统工作有效且功能强大。

著录项

来源
《International conference on web information systems engineering》|2013年|29-42|共14页
会议地点
作者
Shengsheng Shi; Wu Wei; Yulong Liu; Haitao Wang; Lei Luo; Chunfeng Yuan; Yihua Huang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Web data extraction; Extraction Rule language; Data record; Web page navigation; Web data integration;

机译：Web数据提取;提取规则语言;数据记录;网页导航; Web数据整合;

相似文献

外文文献
中文文献
专利

1. Extraction of Frequent Sequential Patterns From Web Usage Data and Their Applications In Pre-Fetching Rules Generation For Effective Web Latency Reduction [J] . Badong Chen, Yueqin Zhu Advances in applied computational mechanics . 2018,第1期

机译：提取Web使用数据的频繁顺序模式及其在预取规则生成中的应用程序，以实现有效的Web等待时间
2. Extraction of Frequent Sequential Patterns From Web Usage Data and Their Applications In Pre-Fetching Rules Generation For Effective Web Latency Reduction [J] . Nooredin Ghadiri Massoom Advances in applied computational mechanics . 2017,第1期

机译：提取Web使用数据的频繁顺序模式及其在预取规则生成中的应用程序，以实现有效的Web等待时间
3. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction [J] . Georg Gottlob, Christoph Koch Journal of the Association for Computing Machinery . 2004,第1期

机译：Monadic Datalog和语言在Web信息提取中的表现力
4. NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model [C] . Shengsheng Shi, Wu Wei, Yulong Liu, International conference on web information systems engineering . 2013

机译：Nexir：一种新的Web提取规则语言，朝向三级Web数据提取模型
5. Heuristic rules for extraction of ontology from Web pages in WebOntEx. [D] . Jain, Bhanu Chaturvedi. 2000

机译：从WebOntEx中的网页提取本体的启发式规则。
6. BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data [O] . Swetlana Nikolajewa, Rainer Pudimat, Michael Hiller, 2007

机译：BioBayesNet：用于生物序列数据特征提取和贝叶斯网络建模的Web服务器
7. Logic, languages, and rules for web data extraction and reasoning over data [O] . Gottlob, G, Koch, C, Pieris, A 2017

机译：Web数据提取和数据推理的逻辑，语言和规则

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

摘要

著录项

相似文献

相关主题

期刊订阅