【24h】

Geodesic distances for web document clustering

机译:Web文档聚类的测地距离

获取原文

摘要

While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article.
机译:尽管传统的距离度量通常能够正确描述对象之间的相似性,但是在某些应用领域,仍然有可能利用数据集中提供的附加信息来微调这些度量。在这项工作中,我们将用于文档分析的传统距离度量与文档之间的链接信息相结合,以改善聚类结果。特别是,我们在0球面的球形几何结构的空间假设下,将测地距离作为相似性度量的有效性进行了测试。因此,我们提出的距离量度是术语文档矩阵的余弦距离与测地距离公式中的某些曲率值的组合。为了估计这些曲率值,我们从数据集的链接图中计算每个文档的聚类系数值,并通过启发式方法提高它们的独特性,因为这些聚类系数值是曲率的粗略估计。为了评估我们的工作,我们使用k-means算法对具有传统余弦距离和拟议的测地距离的英语维基百科超链接数据集进行了聚类测试。我们的方法的有效性通过根据每篇文章提供的分类信息计算聚类的微精度值来衡量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号