首页> 外文期刊>Knowledge-Based Systems >CALA: An unsupervised URL-based web page classification system
【24h】

CALA: An unsupervised URL-based web page classification system

机译:CALA:一种无监督的基于URL的网页分类系统

获取原文
获取原文并翻译 | 示例

摘要

Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.
机译:无监督的网页分类是指将网站中的页面聚类的问题,以便每个聚类都包含一组可以使用唯一类分类的网页。现有的执行网页分类的建议并未满足使它们适合企业Web信息集成的许多要求,即:基于轻量级爬网,以避免干扰网站的正常运行,不受监督,从而无需训练一组预分类页面,也无需使用要分类页面外部的功能,从而避免了下载。在本文中,我们提出了CALA,这是一种新的自动提议,用于生成基于URL的网页分类器。我们的建议建立了许多表示网站中不同类别页面的URL模式,因此可以通过将其URL与模式匹配来对其他页面进行分类。它的显着特征是它可以满足以前的所有要求,并且已经通过使用真实世界,访问量最高的网站的大量实验进行了验证。我们的验证证明,CALA在实践中非常有效。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号