首页> 外文会议>Knowledge discovery and data mining >A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts
【24h】

A Wavelet Transform Based Structural Similarity Model for Semi-structured Texts

机译:基于小波变换的半结构化文本结构相似模型

获取原文
获取原文并翻译 | 示例

摘要

The semi-structured texts including Xml and Html texts are a basic information format in the Internet and World Wide Web. The text content values and the tree-organized structure are two aspects of a semi-structured text. Usually, the same text contents with different structures imply different objects. So the structural similarity of semi-structured texts is an essential key point to search, index, retrieve, query, or compare information in web pages. We presents a Wavelet Transform Based Structural Similarity Model (WTBSSM) in order to fast measure the structural similarity of semi-structured texts and compress the structural information into a short vector so as to develop an efficient semi-structured text index system. This paper introduces the Binary Encoding Method to convert a semi-structured text into a {-1, 1} sequence. Then the text structure signals are decomposed by means of Discrete Wavelet Transform to get the approximation coefficients, which is only a half length of the original signals. Finally, the structure similarity is measured by the Euclidean distance of approximation coefficients. The experimental results show that the WTBSSM can keep almost the same distance distribution to the direct distance of the original signals with a half or a quarter of information. The comparisons with a method of shorten DWT coefficients suggests that WTBSSM is better than it.
机译:包括Xml和Html文本的半结构文本是Internet和Internet中的一种基本信息格式。文本内容值和树状组织的结构是半结构化文本的两个方面。通常,具有不同结构的相同文本内容表示不同的对象。因此,半结构化文本的结构相似性是在网页中搜索,索引,检索,查询或比较信息的关键点。我们提出了一种基于小波变换的结构相似性模型(WTBSSM),以快速测量半结构化文本的结构相似性并将结构信息压缩为短向量,从而开发出一种高效的半结构化文本索引系统。本文介绍了将半结构化文本转换为{-1,1}序列的二进制编码方法。然后利用离散小波变换对文本结构信号进行分解,得到近似系数,近似系数仅为原始信号的一半。最后,通过近似系数的欧几里德距离来测量结构相似性。实验结果表明,WTBSSM能够以一半或四分之一的信息保持与原始信号的直接距离几乎相同的距离分布。与缩短DWT系数的方法进行的比较表明,WTBSSM比它更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号