首页> 外文会议>ACM SIGMOD international conference on Management of data >NoDoSE---a tool for semi-automatically extracting structured and semistructured data from text documents
【24h】

NoDoSE---a tool for semi-automatically extracting structured and semistructured data from text documents

机译:NoDoSE-一种从文本文档中半自动提取结构化和半结构化数据的工具

获取原文

摘要

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

机译:

通常,有趣的结构化或半结构化数据不在数据库系统中,而在HTML页,文本文件或纸质文件中。这些格式的数据无法由标准查询处理引擎使用,因此用户需要一种从这些源中提取数据到DBMS或在源周围编写包装的方法。本文介绍了西北文档结构提取器NoDoSE,它是一种用于半自动确定此类文档的结构然后提取其数据的交互式工具。用户使用GUI,可以分层分解文件,概述其感兴趣的区域,然后描述其语义。挖掘组件会加快此任务的速度,挖掘组件会尝试从用户迄今为止输入的信息中推断文件的语法。一旦确定了文件格式,便可以将其数据提取为多种有用的形式。本文描述了NoDoSE体系结构(可以用作一般的结构挖掘算法的测试平台)以及作者开发的挖掘算法。描述了用Java编写的原型,并报告了解析各种文档的经验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号