...
首页> 外文期刊>Knowledge and Information Systems >Web data extraction based on structural similarity
【24h】

Web data extraction based on structural similarity

机译:基于结构相似度的Web数据提取

获取原文
获取原文并翻译 | 示例
           

摘要

Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.
机译:当今使用的网络数据提取系统主要集中在提取规则的产生,即包装器归纳。因此,当采取整体观点时,它们显得特别,难以整合。数据提取过程中的每个阶段都是断开的,并且没有共享共同的基础,因此无法轻松构建完整的系统。在本文中,我们演示了一种用于Web数据提取的整体方法。我们提案的主要组成部分是文档架构的概念。文档纲要是嵌入文档中的结构的模式。一旦获得文档模式,就可以轻松地集成各个阶段(例如,训练集准备,包装器归纳和文档分类)。这意味着提高了效率并更好地控制了提取过程。我们的实验结果证实了这一点。更重要的是,由于文档可以表示为架构的向量,因此可以轻松地将其作为集成结构合并到现有系统中。

著录项

  • 来源
    《Knowledge and Information Systems》 |2005年第4期|438-461|共24页
  • 作者

    Zhao Li; Wee Keong Ng; Aixin Sun;

  • 作者单位

    Centre for Advanced Information Systems School of Computer Engineering Nanyang Technological University;

    Centre for Advanced Information Systems School of Computer Engineering Nanyang Technological University;

    Centre for Advanced Information Systems School of Computer Engineering Nanyang Technological University;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Classification; Clustering; Framework; Web data extraction;

    机译:分类;聚类;框架;Web数据提取;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号