Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistruc-tured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W_1, W_2,..., W_k) be a list of words which are sorted in lexicographical order and let k > 2 be an integer. Firstly, we define a tree-association pattern on (W_1, W_2,..., W_k). A tree-association pattern on (W_1, W_2,..., W_k) is a sequence 展开▼
机译:随着网络和存储技术的飞速发展,诸如SGML / HTML / XML文件和LaTeX文件之类的电子文档已迅速增加。许多电子文档没有刚性结构,因此被称为半结构化文档。由于许多半结构化文档包含大量的纯文本,因此我们关注半结构化文档中单词之间的结构特征。本文的目的是提出一种用于半结构化文档的文本挖掘技术。我们考虑一个问题,即在半结构化文档的单词中找到所有常见的结构化模式。令(W_1,W_2,...,W_k)为按字典顺序排序的单词列表,令k> 2为整数。首先,我们在(W_1,W_2,...,W_k)上定义树关联模式。 (W_1,W_2,...,W_k)上的树关联模式是序列展开▼