首页>
外国专利>
Techniques for clustering structurally similar webpages based on page features
Techniques for clustering structurally similar webpages based on page features
展开▼
机译:基于页面特征对结构相似的网页进行聚类的技术
展开▼
页面导航
摘要
著录项
相似文献
摘要
Web page clustering techniques described herein are URL Clustering and Page Clustering, whereby clustering algorithms cluster together pages that are structurally similar. Regarding URL clustering, because similarly structured pages have similar patterns in their URLs, grouping similar URL patterns will group structurally similar pages. Embodiments of URL clustering may involve: (a) URL normalization and (b) URL variation computation. Regarding page clustering, page feature-based techniques further cluster any given set of homogenous clusters, reducing the number of clusters based on the underlying page code. Embodiments of page clustering may reduce the number of clusters based on the tag probabilities and the tag sequence, utilizing an Approximate Nearest Neighborhood (ANN) graph along with evaluation of intra-cluster and inter-cluster compactness.
展开▼