首页> 外文会议>International conference on web information systems engineering >NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model
【24h】

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

机译:NEXIR:一种针对三阶段Web数据提取模型的新颖Web提取规则语言

获取原文

摘要

As the most popular information publishing platform, the Web contains a lot of valued data information of interests to users or applications. Nowadays, although a lot of data mining or analysis techniques have been studied in last decade, there are still not many easy-to-use web data mining tools available for users to extract useful data information from the Web. The web information extraction is a whole process involving web page navigation, data extraction and data integration. Unfortunately most of existing studies or systems lack of sufficient consideration toward the three-stage process. Also most of them lack the powerful rules to express the flexible extraction logic to extract data records with complicate structure. In this paper, we propose a novel web data extraction language, NEXIR, toward a three-stage web data extraction model. First of all, the language can define rules for system to automate the navigation process of the web pages, including deep web pages that need interactions from users. Then the language allows users to define flexible and complicated rules to extract data records from web pages and integrate extracted data into a pre-defined structure. A language engine and a prototype extraction system have been implemented based on the proposed language. The experimental results show that our language and system work effective and powerful compared with existing data extraction approaches.
机译:作为最流行的信息发布平台,Web包含许多用户或应用程序感兴趣的有价值的数据信息。如今,尽管在过去的十年中研究了许多数据挖掘或分析技术,但仍然没有太多易于使用的Web数据挖掘工具可用于用户从Web提取有用的数据信息。 Web信息提取是一个涉及Web页面导航,数据提取和数据集成的全过程。不幸的是,大多数现有的研究或系统都缺乏对三阶段过程的充分考虑。他们中的大多数人也缺乏强大的规则来表达灵活的提取逻辑来提取结构复杂的数据记录。在本文中,我们针对三阶段Web数据提取模型提出了一种新颖的Web数据提取语言NEXIR。首先,该语言可以为系统定义规则以使网页的导航过程自动化,包括需要用户交互的深层网页。然后,该语言允许用户定义灵活而复杂的规则,以从网页提取数据记录并将提取的数据集成到预定义的结构中。基于所提出的语言已经实现了语言引擎和原型提取系统。实验结果表明,与现有的数据提取方法相比,我们的语言和系统工作有效且功能强大。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号