首页> 外文会议> >Parallelizing clustering of geoscientific data sets using data streams
【24h】

Parallelizing clustering of geoscientific data sets using data streams

机译:使用数据流对地球科学数据集进行并行聚类

获取原文

摘要

Computing data mining algorithms such as clustering on massive geospatial data sets is still not feasible nor efficient today. In this paper, we introduce a k-means algorithm that is based on the data stream paradigm. The so-called partial/merge k-means algorithm is implemented as a set of data stream operators which are adaptable to available computing resources such as volatile memory and processing power. The partial data stream operator consumes as much data as can befit into RAM, and performs a weighted k-means on the data subset. Subsequently, the weighted partial results are merged by a second data stream operator. All operators can be cloned, and parallelized. In our analytical and experimental performance evaluation, we demonstrate that the partial/merge k-means can outperform a one-step algorithm by a large margin with regard to overall computation time and clustering quality with increasing data density per grid cell.
机译:如今,诸如在大量地理空间数据集上进行聚类之类的计算数据挖掘算法仍然不可行也不高效。在本文中,我们介绍了一种基于数据流范式的k-means算法。所谓的部分/合并k均值算法是作为一组数据流运算符实现的,这些运算符适用于诸如易失性存储器和处理能力之类的可用计算资源。部分数据流运算符会消耗尽可能多的数据以适应RAM,并对数据子集执行加权k均值。随后,第二数据流运算符将加权的部分结果合并。所有运算符都可以克隆和并行化。在我们的分析和实验性能评估中,我们证明了部分/合并k均值可以在总体计算时间和聚类质量(每网格单元数据密度增加)方面大大胜过一步算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号