...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Mining Web informative structures and contents based on entropy analysis
【24h】

Mining Web informative structures and contents based on entropy analysis

机译:基于熵分析的Web信息结构和内容挖掘

获取原文
获取原文并翻译 | 示例

摘要

We study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor texts and links to eliminate the redundancy of the hyperlinked structure so that the complex structure of a Web site can be distilled. However, to increase the value and the accessibility of pages, most of the content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, copy announcements, etc. To further eliminate such redundancy, we propose another mechanism, called InfoDiscoverer, which applies the distilled structure to identify sets of article pages. InfoDiscoverer also employs the entropy information to analyze the information measures of article sets and to extract informative content blocks from these sets. Our result is useful for search engines, information agents, and crawlers to index, extract, and navigate significant information from a Web site. Experiments on several real news Web sites show that the precision and the recall of our approaches are much superior to those obtained by conventional methods in mining the informative structures of news Web sites. On the average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 122 to 257 percent when the desired recall falls between 0.5 and 1. In comparison with manual heuristics, the precision and the recall of InfoDiscoverer are greater than 0.956.
机译:我们研究了挖掘由数千个超链接文档组成的新闻网站的信息结构的问题。我们将新闻网站的信息结构定义为一组索引页面(或称为TOC,即目录,页面)和由这些TOC页面链接的一组文章页面。基于超链接诱导主题搜索(HITS)算法,我们提出了一种基于熵的分析(LAMIS)机制,用于分析锚文本和链接的熵,从而消除了超链接结构的冗余性,从而使网站的复杂结构可以蒸馏。但是,为了增加页面的价值和可访问性,大多数内容网站都倾向于发布带有站点内冗余信息的页面,例如导航面板,广告,复制公告等。为了进一步消除这种冗余,我们提出了另一种机制,称为InfoDiscoverer,它应用提炼的结构来标识文章页面集。 InfoDiscoverer还使用熵信息来分析文章集的信息量度,并从这些集中提取信息量丰富的内容块。我们的结果对于搜索引擎,信息代理和搜寻器从网站索引,提取和导航重要信息很有用。在几个真实新闻网站上进行的实验表明,在挖掘新闻网站的信息结构时,我们的方法的准确性和召回性要比传统方法高得多。平均而言,增强型LAMIS可以显着改善性能,并且当期望的召回率介于0.5和1之间时,其精度可以提高122%至257%。与手动启发式方法相比,InfoDiscoverer的精度和召回率更高。比0.956。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号