Geodesic distances for web document clustering

机译：Web文档聚类的测地距离

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

While traditional distance measures are often capable of properly describing similarity between objects, in some application areas there is still potential to fine-tune these measures with additional information provided in the data sets. In this work we combine such traditional distance measures for document analysis with link information between documents to improve clustering results. In particular, we test the effectiveness of geodesic distances as similarity measures under the space assumption of spherical geometry in a 0-sphere. Our proposed distance measure is thus a combination of the cosine distance of the term-document matrix and some curvature values in the geodesic distance formula. To estimate these curvature values, we calculate clustering coefficient values for every document from the link graph of the data set and increase their distinctiveness by means of a heuristic as these clustering coefficient values are rough estimates of the curvatures. To evaluate our work, we perform clustering tests with the k-means algorithm on the English Wikipedia hyperlinked data set with both traditional cosine distance and our proposed geodesic distance. The effectiveness of our approach is measured by computing micro-precision values of the clusters based on the provided categorical information of each article.

机译：尽管传统的距离度量通常能够正确描述对象之间的相似性，但是在某些应用领域，仍然有可能利用数据集中提供的附加信息来微调这些度量。在这项工作中，我们将用于文档分析的传统距离度量与文档之间的链接信息相结合，以改善聚类结果。特别是，我们在0球面的球形几何结构的空间假设下，将测地距离作为相似性度量的有效性进行了测试。因此，我们提出的距离量度是术语文档矩阵的余弦距离与测地距离公式中的某些曲率值的组合。为了估计这些曲率值，我们从数据集的链接图中计算每个文档的聚类系数值，并通过启发式方法提高它们的独特性，因为这些聚类系数值是曲率的粗略估计。为了评估我们的工作，我们使用k-means算法对具有传统余弦距离和拟议的测地距离的英语维基百科超链接数据集进行了聚类测试。我们的方法的有效性通过根据每篇文章提供的分类信息计算聚类的微精度值来衡量。

著录项

来源
《2011 IEEE Symposium on Computational Intelligence and Data Mining》|2011年|p.15-21|共7页
会议地点
作者
Tekir Selma; Mansmann Florian; Keim Daniel;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类人工智能理论;
关键词

相似文献

外文文献
中文文献
专利

1. WEBCAP: Web Scheduler for Distance Learning Multimedia Documents with Web Workload Considerations [J] . Sami Habib, Maytham Safar International journal of distance education technologies . 2008,第4期

机译：WEBCAP：具有Web工作量注意事项的远程学习多媒体文档的Web计划程序
2. WEB DOCUMENT CLUSTERING THROUGH METAFILE GENERATION FOR DIGRAPH STRUCTURE USING DOCUMENT INDEX GRAPH [J] . BUDI, SRI NURDIATI, BIB PARUHUM SILALAHI Journal of Theoretical and Applied Information Technology . 2014,第1期

机译：通过文档索引图通过元数据生成的Web文档聚类图结构
3. WEB DOCUMENT CLUSTERING THROUGH METAFILE GENERATION FOR DIGRAPH STRUCTURE USING DOCUMENT INDEX GRAPH [J] . BUDI, SRI NURDIATI, BIB PARUHUM SILALAHI Journal of Theoretical and Applied Information Technology . 2014,第1期

机译：通过文档索引图通过元数据生成的Web文档聚类图结构
4. Geodesic distances for web document clustering [C] . Tekir Selma, Mansmann Florian, Keim Daniel IEEE Symposium on Computational Intelligence and Data Mining . 2011

机译：Web文档聚类的测量距离
5. Clustering Web documents: A phrase-based method for grouping search engine results. [D] . Zamir, Oren Eli. 1999

机译：Web文档群集：一种基于短语的方法，用于对搜索引擎结果进行分组。
6. On the Geodesic Distance in Shapes K-means Clustering [O] . Stefano Antonio Gattone, Angela De Sanctis, Stéphane Puechmorel, 2018

机译：在形状k-means聚类的流程距离上
7. Geodesic distances for web document clustering [O] . Tekir Selma, Mansmann Florian, Keim Daniel 2011

机译：Web文档聚类的测地距离

Geodesic distances for web document clustering

摘要

著录项

相似文献

相关主题

期刊订阅