首页> 外文会议>IEEE International Conference on Data Mining Workshops >Maximally Consistent Sampling and the Jaccard Index of Probability Distributions
【24h】

Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

机译:最大一致性抽样和概率分布的雅克指数

获取原文

摘要

We introduce simple, efficient algorithms for computing a MinHash of a probability distribution, suitable for both sparse and dense data, with equivalent running times to the state of the art for both cases. The collision probability of these algorithms is a new measure of the similarity of positive vectors which we investigate in detail. We describe the sense in which this collision probability is optimal for any Locality Sensitive Hash based on sampling. We argue that this similarity measure is more useful for probability distributions than the similarity pursued by other algorithms for weighted MinHash, and is the natural generalization of the Jaccard index.
机译:我们介绍了一种简单有效的算法,用于计算概率分布的MinHash,适用于稀疏和密集数据,两种情况下的运行时间均与最新技术相当。这些算法的碰撞概率是对正矢量相似性的一种新度量,我们将对其进行详细研究。我们基于采样描述了这种碰撞概率对于任何局部敏感哈希都最佳的意义。我们认为,这种相似性度量比其他算法对加权MinHash追求的相似性更可能用于概率分布,并且是Jaccard索引的自然概括。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号