首页> 外文会议>IEEE International Conference on Data Mining Workshops >Scalable Online-Offline Stream Clustering in Apache Spark
【24h】

Scalable Online-Offline Stream Clustering in Apache Spark

机译:Apache Spark中的可扩展的在线离线流集群

获取原文

摘要

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: the online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.
机译:处理大数据的两种最流行的方法是分布式计算和流挖掘。在本文中,我们将这两种方法结合在一起,以将竞争性的流聚类算法CluStream引入到分布式计算的现代框架Apache Spark中。 CluStream是最流行的用于流群集的群集方法之一,并且是一种引入在线-离线挖掘过程的方法:在线阶段通过统计摘要来总结流,而离线阶段根据这些摘要来生成最终的群集。我们获得了一种可扩展的流聚类方法,该方法是开源的,可以由Apache Spark社区使用。我们的实验表明,我们的适应方法可以达到与原始方法相似的质量,同时效率更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号