首页> 美国政府科技报告 >Density Biased Sampling: An Improved Method for Data Mining and Clustering
【24h】

Density Biased Sampling: An Improved Method for Data Mining and Clustering

机译:密度偏差抽样:一种改进的数据挖掘和聚类方法

获取原文

摘要

Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions. Our proposed method scales linearly and out performs uniform samples when clustering realistic data sets.

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号