首页> 外文期刊>International Journal of Data Warehousing and Mining >Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System
【24h】

Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

机译:使用NFA进行Web文档对象的比较挖掘:WebOMiner系统

获取原文
获取原文并翻译 | 示例
       

摘要

The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as "list all articles on 'Sequential Pattern Mining'written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication, " would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer (B2C) web sites, as object oriented database schemas. Then, non-deterministic finite state automata (NFA) based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.
机译:从相关网页中提取和提取历史数据的比较异构网页内容数据的过程仍处于起步阶段,尚未开发。从Web内容中发现潜在有用的,以前未知的信息或知识,例如“列出2007年至2011年之间撰写的有关'Sequential Pattern Mining'的所有文章,包括标题,作者,卷,摘要,论文,引文,出版年代”,来自不同网页的Web文档的架构,执行Web内容数据集成,在从数据库中提取和挖掘Web内容之前构建其虚拟或物理数据仓库。本文提出了一种用于Web内容自动数据提取的技术,即WebOMiner系统,该系统将特定域的网站(如企业对客户(B2C)网站)建模为面向对象的数据库架构。然后,用于识别此域内容类型的基于非确定性有限状态自动机(NFA)的包装器将被构建,并用于将数据块中的相关内容提取到集成数据库中,以供将来进行第二级挖掘以进行深度知识发现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号