首页> 外文会议>International Conference on Research in Intelligent and Computing in Engineering >A Novel Technique for Web Pages Clustering Using LSA and K-Medoids Algorithm
【24h】

A Novel Technique for Web Pages Clustering Using LSA and K-Medoids Algorithm

机译:使用LSA和K-METOIDS算法的网页聚类新技术

获取原文
获取外文期刊封面目录资料

摘要

The extensibility of various web documents available on the web made a critical challenge for many serious tasks such as information retrieval (IR), content monitoring, and indexing. Web documents could be any type of data that can be requested by user and delivered from web server through several web browsers. Most of web documents contain textual contents and are typically called web pages. However, in order to perceive and discover knowledge from these pages, novel techniques are required that have been never applied in other domains. In this paper, a new approach has been proposed by performed latent semantic analysis (LSA) on the result of VSM, which involves the correlation among web pages to their extracted features. The result of LSA involves the matrices that reflect the correlation between the web pages to their related concepts, which were used frequently for retrieving process. PAM (K-Medoids) algorithm was used with respect to semantic space, to portion the web pages into coherent groups. One of the most challenges in any clustering algorithm is to identify the correct number of clusters for the given data. Hence, two approaches are used for this manner: Elbow graph analysis to estimate the number of cluster range based on (SSE) values and clustering evaluation metrics. Calinski-Harabasz criterion (CH) and Silhouette Coefficient (SC) are the best well-known evaluation metrics commonly used in partitioning-based algorithms. UOT has been considered to evaluate the proposed system, and the results are shown in the proposed system to achieve high accuracy results to separate the similar pages into coherent groups.
机译:网络上可用的各种Web文档的可扩展性对许多严重任务(如信息检索(IR),内容监控和索引)进行了危急挑战。 Web文档可以是任何类型的数据,可以由用户请求并通过多个Web浏览器从Web服务器传递。大多数Web文档包含文本内容,通常称为网页。然而,为了从这些页面感知和发现知识,需要新颖的技术,从未应用于其他域。在本文中,已经通过对VSM结果进行了潜在的语义分析(LSA)来提出了一种新方法,这涉及网页之间的相关性与其提取的特征。 LSA的结果涉及反映网页与其相关概念之间的相关性的矩阵,其经常用于检索过程。 PAM(K-METOIDS)算法与语义空间一起使用,将网页分成相干组。任何聚类算法中最多的挑战之一是为给定数据识别正确数量的群集。因此,两种方法用于这种方式:弯头图分析,以估计基于(SSE)值和聚类评估度量的集群范围的数量。 Calinski-Harabasz标准(CH)和轮廓系数(SC)是基于分区算法中的最佳知名评估度量。已经考虑了uot评估所提出的系统,结果显示在所提出的系统中,以实现高精度的结果,以将类似的页面分离成相干群体。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号