首页> 外文期刊>IEEE Transactions on Systems, Man, and Cybernetics >Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark
【24h】

Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark

机译:使用Spark的高速大数据流的最近邻居分类

获取原文
获取原文并翻译 | 示例

摘要

Mining massive and high-speed data streams among the main contemporary challenges in machine learning. This calls for methods displaying a high computational efficacy, with ability to continuously update their structure and handle ever-arriving big number of instances. In this paper, we present a new incremental and distributed classifier based on the popular nearest neighbor algorithm, adapted to such a demanding scenario. This method, implemented in Apache Spark, includes a distributed metric-space ordering to perform faster searches. Additionally, we propose an efficient incremental instance selection method for massive data streams that continuously update and remove outdated examples from the case-base. This alleviates the high computational requirements of the original classifier, thus making it suitable for the considered problem. Experimental study conducted on a set of real-life massive data streams proves the usefulness of the proposed solution and shows that we are able to provide the first efficient nearest neighbor solution for high-speed big and streaming data.
机译:挖掘大规模和高速数据流是当今机器学习的主要挑战。这要求显示出高计算效率的方法,并具有不断更新其结构并处理大量实例的能力。在本文中,我们提出了一种基于流行的最近邻算法的新增量式和分布式分类器,适用于这种苛刻的情况。在Apache Spark中实现的此方法包括分布式度量空间排序,以执行更快的搜索。此外,我们为海量数据流提出了一种有效的增量实例选择方法,该方法不断更新并从案例库中删除过时的实例。这减轻了原始分类器的高计算要求,从而使其适合于所考虑的问题。在一组现实生活中的大量数据流上进行的实验研究证明了该解决方案的有用性,并表明我们能够为高速大数据和流数据提供第一个有效的最近邻居解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号