Exploiting Web Sites Structural and Content Features for Web Pages Clustering

机译：利用Web网站的网站对网页群集的结构和内容功能

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web page clustering is a focal task in Web Mining to organize the content of websites, understanding their structure and discovering interactions among web pages. It is a tricky task since web pages have multiple dimension based on textual, hyperlink and HTML formatting (i.e. HTML tags and visual) properties. Existing algorithms use this information almost independently, mainly because it is difficult to combine them. This paper makes a contribution on clustering of web pages in a website by taking into account a distributional representation that combines all these features into a single vector space. The approach first crawls the website by using web pages' HTML formatting and web lists in order to identify and represent the hyperlink structure by means of an adapted skip-gram model. Then, this hyperlink structure and the textual information are fused into a single vector space representation. The obtained representation is used to cluster websites using simultaneously their hyperlink structure and textual information. Experiments on real websites show that the proposed method improves clustering results.

机译：网页群集是Web挖掘中的一个重点任务，用于组织网站的内容，了解其结构并在网页之间发现交互。这是一个棘手的任务，因为网页基于文本，超链接和HTML格式（即HTML标记和Visual）属性具有多维维度。现有算法几乎独立地使用此信息，主要是因为很难将它们结合起来。本文通过考虑将所有这些特征结合到单个向量空间中的分布表示，对网站进行了贡献。该方法首先使用网页的HTML格式和Web列表来爬网，以便通过适应的跳过模型来识别和表示超链接结构。然后，这种超链接结构和文本信息被融合到单个矢量空间表示中。所获得的表示用于同时使用它们的超链接结构和文本信息来群集网站。真实网站上的实验表明，该方法提高了聚类结果。

著录项

来源
《International Symposium on Methodologies for Intelligent Systems》|2017年|747p|共11页
会议地点
作者
Pasqua Fabiana Lanotte; Fabio Fumarola; Donato Malerba; Michelangelo Ceci;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. What New mothers' favorite Web sites and features tell us about designing web-based health promotion: A content analysis [J] . NellschE.R., WalkerL.O., XieB., Telemedicine and e-health: the official journal of the American Telemedicine Association . 2013,第11期

机译：新妈妈最喜欢的网站和功能告诉我们有关设计基于网络的健康促进的内容：内容分析
2. Relation Extraction from Web Contents with Linguistic and Web Features（言語分析およびWeb上の情報を用いたコンテンツからの関係の抽出） [J] . 顔玉蘭人工知能学会志 . 2011,第1期

机译：使用语言和Web功能从Web内容中提取关系（使用Web上的信息进行语言分析和从内容中提取关系）
3. Web Accessibility in Romania: The Conformance of Municipal Web Sites to Web Content Accessibility Guidelines [J] . Costin PRIBEANU, Ruxandra-Dora MARINESCU, Paul FOGARASSY-NESZLY, Informatica Economica . 2012,第1期

机译：罗马尼亚的Web可访问性：市政网站与Web内容可访问性指南的一致性
4. Exploiting Web Sites Structural and Content Features for Web Pages Clustering [C] . Pasqua Fabiana Lanotte, Fabio Fumarola, Donato Malerba, International symposium on methodologies for intelligent systems . 2017

机译：利用网站的结构和内容功能进行网页群集
5. An evaluation of the quality, readability and Canadian content of Canadian Web sites providing female urinary incontinence information and a brief examination of Web site interactivity. [D] . Farrell, Karen D. 2005

机译：提供女性尿失禁信息的加拿大网站的质量，可读性和加拿大内容的评估，以及网站交互性的简要检查。
6. ProBiS-2012: web server and web services for detection of structurally similar binding sites in proteins [O] . Janez Konc, Dušanka Janežič 2012

机译：ProBiS-2012：用于检测蛋白质中结构相似的结合位点的Web服务器和Web服务
7. Dans ce document, nous nous proposons d’étudier les différentes formes de rentabilisation des sites Internet par l’E-publicité. En effet, la mise en place de sites Internet n’est souvent viable que grâce à l’E-publicité qui en couvre en partie les frais de fonctionnement. Quels sont les caractéristiques de la cyberpublicité, les normes, les formats, les types ? Quels sont les acteurs qui interviennent dans ce processus et quels sont leurs rôles ? Comment se rémunère un site ? Quels sont les différents outils de mesure utilisés ? Quels sont les significations et l’intérêt du référencement, du positionnement, de l’affiliation, du netlinking ? Nous dressons un panorama actuel de la situation et nous montrons comment ces différents outils participent à la bonne marche de l’Internet. Enfin, notre dossier propose des ouvertures sur les nouveaux outils du marketing. In this document, we focus on the analysis of the different ways of profitability of the websites performing e-advertising. Indeed the websites' start-up becomes only profitable when the e-advertising appears which covers partially the running costs. What are the features of e-advertising, the rules, the lay-outs, the models? Who are the participants enrolled in this process and what are their roles? How is the payment of a website? What are the different measuring tools used? What are the meanings and the interests of the reference, the positioning and the affiliation of netlinking? We draw up a current study of the situation and explain how those different elements contribute to the smooth running of internet? Finally our paper proposes openings on the new tools of marketing? [O] . Joël Moulhade 100

机译：Dans ce document，nous nous proposons d'étudierlesdifférentesformesde rentabilisation des sites Internet par l'E-publicité。 En effet，la mise en place de sites Internet n'est souvent possiblequegrâceàl'E-publicitéquien couvre en partie les frais de fonctionnement。 Quels sontlescaractéristiquesdelacyberpublicité，les normes，les formats，les types？ Quels sont les acteurs qui interviennent dans ce processus et quels sontleursrôles？评论serémunèreunsite？ Quels sontlesdifférentsoutilsdemesureutilisés？ Quels sont les significations et l'intérêtduréférencement，du positionnement，de l'affiliation，du netlinking？ Nous dressons un panorama actuel de la situation et nous montrons commentcesdifférentsoutilsparticipentàlabonne marche de l'Internet。 Enfin，notre dossier建议des ouvertures sur les nouveaux outils du marketing。在本文档中，我们重点分析了执行电子广告的网站的不同盈利方式。事实上，当电子广告出现时，网站的初创公司才有利可图，这部分地涵盖了运营成本。电子广告，规则，布局，模型有哪些特点？参加此过程的参与者是谁？他们的角色是什么？如何支付网站？使用了哪些不同的测量工具？ netlinking的参考，定位和隶属关系的含义和利益是什么？我们制定了当前的情况研究，并解释了这些不同因素如何促进互联网的顺利运行？最后，我们的论文提出了新的营销工具的开放性？
8. Exploitation of World Wide Web to Support Network Updating of Vector ProductFormat Mapping Database at a Feature Level [R] . Chung, M., Cobb, M., Shaw, K., 1996

机译：利用万维网技术支持特征级矢量productFormat映射数据库的网络更新

Exploiting Web Sites Structural and Content Features for Web Pages Clustering

摘要

著录项

相似文献

相关主题

期刊订阅