Extracting Characteristic Structures among Words in Semistructured Documents

机译：提取半结构化文档中单词间的特征结构

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistruc-tured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W_1, W_2,..., W_k) be a list of words which are sorted in lexicographical order and let k > 2 be an integer. Firstly, we define a tree-association pattern on (W_1, W_2,..., W_k). A tree-association pattern on (W_1, W_2,..., W_k) is a sequence

机译：随着网络和存储技术的飞速发展，诸如SGML / HTML / XML文件和LaTeX文件之类的电子文档已迅速增加。许多电子文档没有刚性结构，因此被称为半结构化文档。由于许多半结构化文档包含大量的纯文本，因此我们关注半结构化文档中单词之间的结构特征。本文的目的是提出一种用于半结构化文档的文本挖掘技术。我们考虑一个问题，即在半结构化文档的单词中找到所有常见的结构化模式。令（W_1，W_2，...，W_k）为按字典顺序排序的单词列表，令k> 2为整数。首先，我们在（W_1，W_2，...，W_k）上定义树关联模式。（W_1，W_2，...，W_k）上的树关联模式是序列

著录项

来源
《Advances in Knowledge Discovery and Data Mining》|2002年|p.356-367|共12页
会议地点
作者
Kazuyoshi Furukawa; Tomoyuki Uchida; Kazuya Yamada; Tetsuhiro Miyahara; Takayoshi Shoudai; Yasuaki Nakamura;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Complexity of extracting database schema from semistructured documents [J] . Nobutaka Suzuki, Yoichirou Sato, Michiyoshi Hayase 電子情報通信学会技術研究報告. コンピュテ-ション. Theoretical Foundations of Computing . 2000,第705期

机译：从半系统中提取数据库架构的复杂性
2. Complexity of extracting database schema from semistructured documents [J] . Nobutaka Suzuki, Yoichirou Sato, Michiyoshi Hayase 電子情報通信学会技術研究報告. コンピュテ-ション. Theoretical Foundations of Computing . 2000,第705期

机译：从半系统中提取数据库架构的复杂性
3. An effective framework for semistructured document classification via hierarchical attention model [J] . Weizhong Zhao, Dandan Fang, Jinyong Zhang, International Journal of Intelligent Systems . 2021,第9期

机译：通过分层注意模型进行半系统文档分类的有效框架
4. Extracting Characteristic Structures among Words in Semistructured Documents [C] . Kazuyoshi Furukawa, Tomoyuki Uchida, Kazuya Yamada, Pacific-Asia Conference on Knowledge Discovery and Data Mining . 2002

机译：在半系统文档中提取单词中的特征结构
5. Keyword search in structured and semistructured databases. [D] . Hristidis, Vagelis. 2004

机译：在结构化和半结构化数据库中的关键字搜索。
6. Performance of a Natural Language Processing (NLP) Tool to Extract Pulmonary Function Test (PFT) Reports from Structured and Semistructured Veteran Affairs (VA) Data [O] . Brian C. Sauer, Barbara E. Jones, Gary Globe, -1

机译：从结构化和半结构化退伍军人事务（VA）数据提取肺功能测试（PFT）报告的自然语言处理（NLP）工具的性能
7. NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. [O] . Brad Adelberg 1998

机译：NoDoSE-从文本文档中半自动提取结构化和半结构化数据的工具。

Extracting Characteristic Structures among Words in Semistructured Documents

摘要

著录项

相似文献

相关主题

期刊订阅