首页> 外文OA文献 >Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents
【2h】

Efficient Storage and Domain-Specific Information Discovery on Semistructured Documents

机译:半结构化文档的高效存储和特定领域信息发现

摘要

The increasing amount of available semistructured data demands efficient mechanisms to store, process, and search an enormous corpus of data to encourage its global adoption. Current techniques to store semistructured documents either map them to relational databases, or use a combination of flat files and indexes. These two approaches result in a mismatch between the tree-structure of semistructured data and the access characteristics of the underlying storage devices. Furthermore, the inefficiency of XML parsing methods has slowed down the large-scale adoption of XML into actual system implementations. The recent development of lazy parsing techniques is a major step towards improving this situation, but lazy parsers still have significant drawbacks that undermine the massive adoption of XML. Once the processing (storage and parsing) issues for semistructured data have been addressed, another key challenge to leverage semistructured data is to perform effective information discovery on such data. Previous works have addressed this problem in a generic (i.e. domain independent) way, but this process can be improved if knowledge about the specific domain is taken into consideration. This dissertation had two general goals: The first goal was to devise novel techniques to efficiently store and process semistructured documents. This goal had two specific aims: We proposed a method for storing semistructured documents that maps the physical characteristics of the documents to the geometrical layout of hard drives. We developed a Double-Lazy Parser for semistructured documents which introduces lazy behavior in both the pre-parsing and progressive parsing phases of the standard Document Object Modelu27s parsing mechanism. The second goal was to construct a user-friendly and efficient engine for performing Information Discovery over domain-specific semistructured documents. This goal also had two aims: We presented a framework that exploits the domain-specific knowledge to improve the quality of the information discovery process by incorporating domain ontologies. We also proposed meaningful evaluation metrics to compare the results of search systems over semistructured documents.
机译:越来越多的可用半结构化数据需要有效的机制来存储,处理和搜索庞大的数据集,以鼓励其在全球范围内采用。当前存储半结构化文档的技术要么将它们映射到关系数据库,要么使用平面文件和索引的组合。这两种方法导致半结构化数据的树结构与基础存储设备的访问特征之间的不匹配。此外,XML解析方法的效率低下已经减慢了XML在实际系统实现中的大规模采用。延迟解析技术的最新发展是朝着改善这种状况迈出的重要一步,但是延迟解析器仍然具有严重的缺陷,这些缺陷会破坏XML的广泛采用。解决了半结构化数据的处理(存储和解析)问题后,利用半结构化数据的另一个关键挑战是对此类数据执行有效的信息发现。先前的工作已经以通用的(即与领域无关的)方式解决了这个问题,但是如果考虑到关于特定领域的知识,则可以改善该过程。本论文有两个总体目标:第一个目标是设计新颖的技术来有效地存储和处理半结构化文档。这个目标有两个具体目的:我们提出了一种存储半结构化文档的方法,该方法将文档的物理特性映射到硬盘驱动器的几何布局。我们为半结构化文档开发了Double-Lazy Parser,它在标准Document Object Model的解析机制的预解析和渐进式解析阶段引入了惰性行为。第二个目标是构建一个用户友好且高效的引擎,以对特定于域的半结构化文档执行信息发现。这个目标还有两个目标:我们提出了一个框架,该框架利用特定领域的知识,通过合并领域本体来提高信息发现过程的质量。我们还提出了有意义的评估指标,以比较半结构化文档上搜索系统的结果。

著录项

  • 作者

    Farfan Fernando R;

  • 作者单位
  • 年度 2009
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号