首页> 外文学位 >Data extraction and integration of semistructured documents.
【24h】

Data extraction and integration of semistructured documents.

机译:半结构化文档的数据提取和集成。

获取原文
获取原文并翻译 | 示例

摘要

Before the vision of the Semantic Web in which data is shared in a meaningful and effective way is realized, we have to deal with large volumes of legacy HTML documents. Information in the documents is buried in the text because HTML is for visual rendering, not for describing the data. State-of-the-art information retrieval techniques rely on keyword-based search engines. They do not support structured queries on the documents. A user may to facilitate visual browsing and data management. Existing approaches do not support an automated integration of heterogeneous documents.; This dissertation aims to address these issues to make information buried in the HTML documents accessible to users and applications. Transforming the whole Web into a structured collection of documents is intractable. Thus, we focus our attention on topic specific HTML documents—documents pertaining to a specific topic, authored by different people from diverse data sources.; We present Quixote, a tool that integrates topic specific HTML documents into XML documents conforming to a global schema. It consists of three components: (1) Document Converter. It extracts information from HTML documents and encodes such information in XML documents. It automatically extracts the information by rules that are insensitive to changes of the data formats and are applicable to diverse sources of data. It does not assume that the documents follow a known format. It only assumes the records within a document follow some regular format. (2) Schema Miner. We propose a new type of approximate schema called majority schema that describes only prevalent structures in a collection of XML documents. The Schema Miner infers a majority schema from the documents, which Document Transformer. It automatically integrates XML documents based on a majority schema discovered. It adapts techniques from schema integration approaches on relational data to XML data. It addresses the unique challenge of preserving semantics of the documents in the integration process since a majority schema does not cover all structures in the documents.
机译:在实现以有意义和有效的方式共享数据的语义Web的愿景之前,我们必须处理大量的旧版HTML文档。文档中的信息被埋在文本中,因为HTML用于可视化呈现,而不是用于描述数据。最新的信息检索技术依赖于基于关键字的搜索引擎。它们不支持对文档的结构化查询。用户可以促进视觉浏览和数据管理。现有方法不支持异类文档的自动集成。本文旨在解决这些问题,使隐藏在HTML文档中的信息可供用户和应用程序访问。将整个Web转换为结构化的文档集合是很棘手的。因此,我们将注意力集中在特定主题的HTML文档上,这些文档是与特定主题相关的文档,由来自不同数据源的不同人员撰写。我们将介绍Quixote,这是一种将主题特定的HTML文档集成到符合全局架构的XML文档中的工具。它由三个组件组成:(1) Document Converter 。它从HTML文档中提取信息,并将这些信息编码为XML文档。它通过对数据格式的变化不敏感的规则自动提取信息,并且适用于各种数据源。它不假定文档遵循已知格式。它仅假设文档中的记录遵循某种常规格式。 (2) Schema Miner 。我们提出了一种新型的近似模式,称为多数模式,它仅描述XML文档集合中的流行结构。模式挖掘器从文档推断出多数模式,即文档转换工具。它基于发现的多数方案自动集成XML文档。它采用了从关系数据到XML数据的模式集成方法的技术。由于多数模式未涵盖文档中的所有结构,因此它解决了在集成过程中保留文档语义的独特挑战。

著录项

  • 作者

    Chung, Yip.;

  • 作者单位

    University of California, Davis.;

  • 授予单位 University of California, Davis.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2001
  • 页码 228 p.
  • 总页数 228
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号