Mining Web informative structures and contents based on entropy analysis

Hung-Yu Kao; Shian-Hua Lin; Jan-Ming Ho; Ming-Syan Chen

首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Mining Web informative structures and contents based on entropy analysis

【24h】

Mining Web informative structures and contents based on entropy analysis

机译：基于熵分析的Web信息结构和内容挖掘

获取原文

获取原文并翻译 | 示例

获取外文期刊封面封底 >>

开具论文收录证明 >>

文献代查 >>

团队文献服务 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor texts and links to eliminate the redundancy of the hyperlinked structure so that the complex structure of a Web site can be distilled. However, to increase the value and the accessibility of pages, most of the content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, copy announcements, etc. To further eliminate such redundancy, we propose another mechanism, called InfoDiscoverer, which applies the distilled structure to identify sets of article pages. InfoDiscoverer also employs the entropy information to analyze the information measures of article sets and to extract informative content blocks from these sets. Our result is useful for search engines, information agents, and crawlers to index, extract, and navigate significant information from a Web site. Experiments on several real news Web sites show that the precision and the recall of our approaches are much superior to those obtained by conventional methods in mining the informative structures of news Web sites. On the average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 122 to 257 percent when the desired recall falls between 0.5 and 1. In comparison with manual heuristics, the precision and the recall of InfoDiscoverer are greater than 0.956.

机译：我们研究了挖掘由数千个超链接文档组成的新闻网站的信息结构的问题。我们将新闻网站的信息结构定义为一组索引页面（或称为TOC，即目录，页面）和由这些TOC页面链接的一组文章页面。基于超链接诱导主题搜索（HITS）算法，我们提出了一种基于熵的分析（LAMIS）机制，用于分析锚文本和链接的熵，从而消除了超链接结构的冗余性，从而使网站的复杂结构可以蒸馏。但是，为了增加页面的价值和可访问性，大多数内容网站都倾向于发布带有站点内冗余信息的页面，例如导航面板，广告，复制公告等。为了进一步消除这种冗余，我们提出了另一种机制，称为InfoDiscoverer，它应用提炼的结构来标识文章页面集。 InfoDiscoverer还使用熵信息来分析文章集的信息量度，并从这些集中提取信息量丰富的内容块。我们的结果对于搜索引擎，信息代理和搜寻器从网站索引，提取和导航重要信息很有用。在几个真实新闻网站上进行的实验表明，在挖掘新闻网站的信息结构时，我们的方法的准确性和召回性要比传统方法高得多。平均而言，增强型LAMIS可以显着改善性能，并且当期望的召回率介于0.5和1之间时，其精度可以提高122％至257％。与手动启发式方法相比，InfoDiscoverer的精度和召回率更高。比0.956。

著录项

来源
《IEEE Transactions on Knowledge and Data Engineering 》 |2004年第1期| p.41-55| 共15页
作者
Hung-Yu Kao; Shian-Hua Lin; Jan-Ming Ho; Ming-Syan Chen;
展开▼
作者单位

Dept. of Electr. Eng., Nat. Taiwan Univ., Taipei, Taiwan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术 ;
关键词
publishing; Web sites; data mining; information retrieval; text analysis; search engines; Web informative structure mining; entropy analysis; news Web site; hyperlinked documents; index pages; table of contents; TOC pages; hyperlink induced topics search; HITS; entropy-based analysis mechanism; anchor texts; LAMIS; intrasite redundant information; navigation panels; advertisements; copy announcements; InfoDiscoverer; article pages; entropy information; information measures; informative content blocks; search engines; information agents; Web crawlers; information extraction;

机译：发布;网站;数据挖掘;信息检索;文本分析;搜索引擎;Web信息结构挖掘;熵分析;新闻网站;超链接文档;索引页;目录;TOC页面;超链接诱导主题搜索;HITS;熵的分析机制;锚文本;LAMIS;站点内冗余信息;导航面板;广告;复制公告;InfoDiscoverer;文章页面;熵信息;信息措施;信息内容块;搜索引擎;信息代理;Web爬虫;信息提取;

相似文献

外文文献
中文文献
专利

1. WISDOM: Web intrapage informative structure mining based on document object model [J] . Hung-Yu Kao, Jan-Ming Ho, Ming-Syan Chen IEEE Transactions on Knowledge and Data Engineering . 2005 ,第5期

机译：WISDOM：基于文档对象模型的Web页内信息结构挖掘
2. Dynamic user profiles using fusion of Web Structure ,Web content and Web Usage Mining [J] . Prof. Gajendra S.Chandel, Prof. Ravindra Gupta, Mr. Hemant k. Dhamecha International Journal of Engineering Research and Applications . 2012 ,第3期

机译：使用Web结构，Web内容和Web用法挖掘融合的动态用户配置文件
3. Dynamic user profiles using fusion of Web Structure ,Web content and Web Usage Mining [J] . Prof. Gajendra S.Chandel, Prof. Ravindra Gupta, Mr. Hemant k. Dhamecha International Journal of Engineering Research and Applications . 2012 ,第3期

机译：使用Web结构，Web内容和Web用法挖掘融合的动态用户配置文件
4. Entropy based informative content density approach for efficient web content extraction [C] . Manjusha Annam, G P Sajeev International conference on advances in computing, communications and informatics . 2016

机译：基于熵的信息内容密度方法，可有效地提取Web内容
5. Characterizing different brain structures based on information content: Diffusion entropy. [D] . Fozouni, Niloufar. 2010

机译：根据信息内容表征不同的大脑结构：扩散熵。
6. AHCODA-DB: a data repository with web-based mining tools for the analysis of automated high-content mouse phenomics data [O] . Bastijn Koopmans, August B. Smit, Matthijs Verhage, 2017

机译：AHCODA-DB：带有基于Web的挖掘工具的数据库用于分析自动化的高含量鼠标特征数据
7. Mining web informative structures and contents based on entropy analysis [O] . Hung-yu Kao, Shian-hua Lin, Jan-ming Ho, 2004

机译：基于熵分析挖掘Web信息结构和内容

Mining Web informative structures and contents based on entropy analysis

摘要

著录项

相似文献

相关主题

期刊订阅