首页> 外文期刊>Future generation computer systems >Distributed nearest neighbor classification for large-scale multi-label data on spark
【24h】

Distributed nearest neighbor classification for large-scale multi-label data on spark

机译:针对Spark上的大规模多标签数据的分布式最近邻分类

获取原文
获取原文并翻译 | 示例

摘要

Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor (ML-KNN). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed ML-KNN implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables ML-KNN to scale efficiently with respect to the size of the problem. (C) 2018 Elsevier B.V. All rights reserved.
机译:现代数据的特点是其数量和复杂性不断增加,尤其是当数据实例同时属于许多类别时。这种学习范例称为多标签分类,其最著名的方法之一是多标签k最近邻居(ML-KNN)。由于其复杂性和内存限制,这种方法的传统实现方式不适用于大规模的多标签数据。我们提出了基于MapReduce编程模型的分布式ML-KNN实现,该模型在Apache Spark上实现。我们比较了三种用于分布式最近邻居搜索的策略:1)迭代广播实例,2)使用基于树的分布式索引结构,以及3)建立哈希表以对实例进行分组。实验研究评估了22个基准数据集的预测质量和运行时之间的权衡,并比较了使用不同数据量的可伸缩性。结果表明,基于树的索引策略的性能优于其他方法,最大数据集的速度提高了266倍,同时实现了与精确方法相当的准确性。这种策略使ML-KNN可以有效地扩展问题的规模。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号