首页> 中文期刊>计算机应用与软件 >基于互信息度量的 Web信息抽取

基于互信息度量的 Web信息抽取

     

摘要

How to extract valuable information from complex web pages is an important issue in information retrieval and Web data mining . We utilise the distribution feature presented by the information of webpage set and propose a mutual information metric -based Web information extraction method , it can automatically identify the noisy information and keep the key information .In this method , webpage is parsed into a DOM tree and the mutual information value of leaf nodes is calculated .Then the leaf nodes are block aggregated according to the structure of the DOM tree, the mutual information value of tag is upward recursively computed and is set as the threshold to distinguish the non-noise from noise.Experiments and contrast results on various famous domestic websites prove the effectiveness of the proposed method .%如何从纷繁复杂的网页中抽取有价值的信息是信息检索和Web数据挖掘中的重要问题。利用网页集信息所呈现的分布特点,提出基于互信息度量的Web信息抽取方法,它能够自动识别噪声信息并保留关键信息。该方法将网页解析成DOM树,计算叶子节点的互信息值;然后按DOM树结构对叶子节点进行分块聚集,向上递归求得标签<body>的互信息值,并以此作为阈值区分噪声与非噪声。最后与多个国内知名网站上的实验及对比结果证明了该方法的有效性。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号