首页> 外文会议>Advances in Knowledge Discovery and Data Mining >Extracting Characteristic Structures among Words in Semistructured Documents
【24h】

Extracting Characteristic Structures among Words in Semistructured Documents

机译:提取半结构化文档中单词间的特征结构

获取原文

摘要

Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistruc-tured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W_1, W_2,..., W_k) be a list of words which are sorted in lexicographical order and let k > 2 be an integer. Firstly, we define a tree-association pattern on (W_1, W_2,..., W_k). A tree-association pattern on (W_1, W_2,..., W_k) is a sequence
机译:随着网络和存储技术的飞速发展,诸如SGML / HTML / XML文件和LaTeX文件之类的电子文档已迅速增加。许多电子文档没有刚性结构,因此被称为半结构化文档。由于许多半结构化文档包含大量的纯文本,因此我们关注半结构化文档中单词之间的结构特征。本文的目的是提出一种用于半结构化文档的文本挖掘技术。我们考虑一个问题,即在半结构化文档的单词中找到所有常见的结构化模式。令(W_1,W_2,...,W_k)为按字典顺序排序的单词列表,令k> 2为整数。首先,我们在(W_1,W_2,...,W_k)上定义树关联模式。 (W_1,W_2,...,W_k)上的树关联模式是序列

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号