NoDoSE---a tool for semi-automatically extracting structured and semistructured data from text documents

机译：NoDoSE-一种从文本文档中半自动提取结构化和半结构化数据的工具

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

机译：

通常，有趣的结构化或半结构化数据不在数据库系统中，而在HTML页，文本文件或纸质文件中。这些格式的数据无法由标准查询处理引擎使用，因此用户需要一种从这些源中提取数据到DBMS或在源周围编写包装的方法。本文介绍了西北文档结构提取器NoDoSE，它是一种用于半自动确定此类文档的结构然后提取其数据的交互式工具。用户使用GUI，可以分层分解文件，概述其感兴趣的区域，然后描述其语义。挖掘组件会加快此任务的速度，挖掘组件会尝试从用户迄今为止输入的信息中推断文件的语法。一旦确定了文件格式，便可以将其数据提取为多种有用的形式。本文描述了NoDoSE体系结构（可以用作一般的结构挖掘算法的测试平台）以及作者开发的挖掘算法。描述了用Java编写的原型，并报告了解析各种文档的经验。展开▼

著录项

来源
《ACM SIGMOD international conference on Management of data》|1998年|P.283-294|共12页
会议地点
作者
Brad Adelberg;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类各种专用数据库;
关键词

相似文献

外文文献
中文文献
专利

1. Complexity of extracting database schema from semistructured documents [J] . Nobutaka Suzuki, Yoichirou Sato, Michiyoshi Hayase 電子情報通信学会技術研究報告. コンピュテ-ション. Theoretical Foundations of Computing . 2000,第705期

机译：从半系统中提取数据库架构的复杂性
2. Complexity of extracting database schema from semistructured documents [J] . Nobutaka Suzuki, Yoichirou Sato, Michiyoshi Hayase 電子情報通信学会技術研究報告. コンピュテ-ション. Theoretical Foundations of Computing . 2000,第705期

机译：从半系统中提取数据库架构的复杂性
3. METSP: A Maximum-Entropy Classifier Based Text Mining Tool for Transporter-Substrate Identification with Semistructured Text [J] . Min Zhao, Yanming Chen, Dacheng Qu, BioMed research international . 2015,第39期

机译：METSP：基于最大熵分类器的基于文本挖掘工具，用于具有半系统的Transporter-Bask识别
4. On Extracting a Database Schema from Semistructured Documents [C] . Nobutaka SUZUKI, Yoichirou SATO, Michiyoshi HAYASE World Multiconference on Systemics, Cybernetics and Informatics(SCI 2001) v.14: Computer Science and Engineering pt.2; 20010722-20010725; Orlando,FL; US . 2001

机译：从半结构化文档中提取数据库模式
5. Data extraction and integration of semistructured documents. [D] . Chung, Yip. 2001

机译：半结构化文档的数据提取和集成。
6. Performance of a Natural Language Processing (NLP) Tool to Extract Pulmonary Function Test (PFT) Reports from Structured and Semistructured Veteran Affairs (VA) Data [O] . Brian C. Sauer, Barbara E. Jones, Gary Globe, -1

机译：从结构化和半结构化退伍军人事务（VA）数据提取肺功能测试（PFT）报告的自然语言处理（NLP）工具的性能
7. NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. [O] . Brad Adelberg 1998

机译：NoDoSE-从文本文档中半自动提取结构化和半结构化数据的工具。

NoDoSE---a tool for semi-automatically extracting structured and semistructured data from text documents

摘要

著录项

相似文献

相关主题

期刊订阅