首页> 外文期刊>Expert Systems with Application >Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis
【24h】

Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

机译:自动生成站点地图:使用块提取和超链接分析来探索网站结构

获取原文
获取原文并翻译 | 示例

摘要

Sitemaps designed by webmasters are not only presenting the main usage flows for users, but also organizing the hierarchical concept of the website. However, websites seldom provide sitemap pages to facilitate users to browse pages easily. Even provided, these sitemaps are not for machine-understanding, although few websites provide sitemaps with the XML format. In this paper, we develop a system, Site-Map Generator (SMG), to automatically generate the hierarchical sitemap for a website. SMG consists of five components. Sequence Translator translates a page's HTML source into a long sequence and then Page Partitioner splits the page into blocks based on analyzing the sequence complexity. Block Identifier categorizes each block into one of three block types: content, structure or redundant. Using the popular sequence searching tool, BLAST, Block Cluster calculates similarities between blocks so that blocks with similar functionalities are grouped and considered as candidate blocks for the sitemap. Finally, Hyperlink Analyzer transforms page-to-page links into block-to-block links and applies Kleinberg's HITS algorithm to estimate authority and hub values of each block. Block entropy value derived from features entropies is also used to improve the HITS. Several experiments on three websites: Mozilla, CNN and Yahoo! News, show that SMG is useful to partition a page into blocks (Fl = 86%), identify the block type (Fl = 85%), and generate the sitemap for the website (Fl = 63%).
机译:网站管理员设计的站点地图不仅向用户展示了主要的使用流程,而且还组织了网站的层次结构概念。但是,网站很少提供站点地图页面来方便用户轻松浏览页面。即使提供了这些站点地图,也无法理解它们,尽管很少有网站提供XML格式的站点地图。在本文中,我们开发了一个站点地图生成器(SMG)系统,以自动生成网站的分层站点地图。 SMG由五个部分组成。序列转换器将页面的HTML源代码转换为较长的序列,然后页面分区程序基于对序列复杂度的分析,将页面分为多个块。块标识符将每个块分为以下三种块类型之一:内容,结构或冗余。使用流行的序列搜索工具BLAST,块聚类计算块之间的相似度,以便将具有相似功能的块分组并视为站点地图的候选块。最后,超链接分析器将页面到页面的链接转换为块到块的链接,并应用Kleinberg的HITS算法来估计每个块的权限和中心值。从特征熵导出的块熵值也用于改善HITS。在三个网站上进行了一些实验:Mozilla,CNN和Yahoo!新闻显示,SMG可用于将页面划分为多个块(F1 = 86%),识别块类型(F1 = 85%)并生成网站的站点地图(F1 = 63%)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号