首页> 外文会议>Advances in Knowledge Discovery and Data Mining >Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents
【24h】

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

机译:在半结构化Web文档中发现频繁的标记树模式

获取原文

摘要

Many Web documents such as HTML files and XML files have no rigid structure and are called Semistructured data. In general, such Semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in Semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge labeled tree with ordered children which has structured variables. An edge label is a tag or a keyword in such Web documents, and a variable can be substituted by an arbitrary tree. So a tag tree pattern is suited for representing tree structured patterns in such Web documents. First we show that it is hard to compute the optimum frequent tag tree pattern. So we present an algorithm for generating all maximally frequent tag tree patterns and give the correctness of it. Finally, we report some experimental results on our algorithm. Although this algorithm is not efficient, experiments show that we can extract characteristic tree structured patterns in those data.
机译:许多Web文档(例如HTML文件和XML文件)没有严格的结构,因此称为半结构化数据。通常,这种半结构化Web文档由带有有序子级的有根树表示。我们提出了一种新的方法,以标记树模式为假设,在半结构化Web文档中发现频繁的树结构模式。标记树模式是带有结构化变量的带有有序子级的边缘标记树。边缘标签是此类Web文档中的标签或关键字,并且变量可以由任意树替换。因此,标记树模式适合于在此类Web文档中表示树状结构的模式。首先,我们表明很难计算出最佳的频繁标记树模式。因此,我们提出了一种算法,用于生成所有最大频率的标记树模式,并给出其正确性。最后,我们报告了有关该算法的一些实验结果。尽管该算法效率不高,但实验表明我们可以从这些数据中提取特征树结构模式。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号