首页> 外文会议>International Symposium on Methodologies for Intelligent Systems >Exploiting Web Sites Structural and Content Features for Web Pages Clustering
【24h】

Exploiting Web Sites Structural and Content Features for Web Pages Clustering

机译:利用Web网站的网站对网页群集的结构和内容功能

获取原文

摘要

Web page clustering is a focal task in Web Mining to organize the content of websites, understanding their structure and discovering interactions among web pages. It is a tricky task since web pages have multiple dimension based on textual, hyperlink and HTML formatting (i.e. HTML tags and visual) properties. Existing algorithms use this information almost independently, mainly because it is difficult to combine them. This paper makes a contribution on clustering of web pages in a website by taking into account a distributional representation that combines all these features into a single vector space. The approach first crawls the website by using web pages' HTML formatting and web lists in order to identify and represent the hyperlink structure by means of an adapted skip-gram model. Then, this hyperlink structure and the textual information are fused into a single vector space representation. The obtained representation is used to cluster websites using simultaneously their hyperlink structure and textual information. Experiments on real websites show that the proposed method improves clustering results.
机译:网页群集是Web挖掘中的一个重点任务,用于组织网站的内容,了解其结构并在网页之间发现交互。这是一个棘手的任务,因为网页基于文本,超链接和HTML格式(即HTML标记和Visual)属性具有多维维度。现有算法几乎独立地使用此信息,主要是因为很难将它们结合起来。本文通过考虑将所有这些特征结合到单个向量空间中的分布表示,对网站进行了贡献。该方法首先使用网页的HTML格式和Web列表来爬网,以便通过适应的跳过模型来识别和表示超链接结构。然后,这种超链接结构和文本信息被融合到单个矢量空间表示中。所获得的表示用于同时使用它们的超链接结构和文本信息来群集网站。真实网站上的实验表明,该方法提高了聚类结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号