首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs
【24h】

Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs

机译:大规模图相似搜索的实用算法和下界

获取原文
获取原文并翻译 | 示例

摘要

To exploit the similarity information hidden in the hyperlink structure of the Web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multistep neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [1], a recursive refinement of cocitation, and PSimRank, a novel variant with better theoretical characteristics. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. We justify our approximation method by asymptotic worst-case lower bounds: We show that there is a significant gap between exact and approximate approaches, and suggest that the exact computation, in general, is infeasible for large-scale inputs. We were the first to evaluate SimRank on real Web data. On the Stanford WebBase [2] graph of 80M pages the quality of the methods increased significantly in each refinement step until step four.
机译:为了利用隐藏在Web的超链接结构中的相似性信息,本文介绍了可扩展到分布式体系结构上具有数十亿个顶点的图的算法。顶点的多步邻域的相似性通过相似性函数进行数值评估,这些函数包括SimRank [1](对递归的精细化)和PSimRank(具有更好的理论特征的新颖变体)。我们的方法是在蒙特卡洛相似性搜索算法的通用框架中提出的,该算法预先计算了随机指纹的索引数据库,并且在查询时,会从指纹中估算出相似性。我们用渐近最坏情况下界证明我们的近似方法是正确的:我们证明了精确方法与近似方法之间存在显着的差距,并建议一般而言,精确计算对于大规模输入而言是不可行的。我们是第一个在真实Web数据上评估SimRank的人。在拥有8000万页的Stanford WebBase [2]图上,方法的质量在每个细化步骤中都显着提高,直到第四步为止。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号