Distributed nearest neighbor classification for large-scale multi-label data on spark

Gonzalez-Lopez Jorge; Ventura Sebastian; Cano Alberto

首页> 外文期刊>Future generation computer systems >Distributed nearest neighbor classification for large-scale multi-label data on spark

【24h】

Distributed nearest neighbor classification for large-scale multi-label data on spark

机译：针对Spark上的大规模多标签数据的分布式最近邻分类

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor (ML-KNN). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed ML-KNN implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables ML-KNN to scale efficiently with respect to the size of the problem. (C) 2018 Elsevier B.V. All rights reserved.

机译：现代数据的特点是其数量和复杂性不断增加，尤其是当数据实例同时属于许多类别时。这种学习范例称为多标签分类，其最著名的方法之一是多标签k最近邻居（ML-KNN）。由于其复杂性和内存限制，这种方法的传统实现方式不适用于大规模的多标签数据。我们提出了基于MapReduce编程模型的分布式ML-KNN实现，该模型在Apache Spark上实现。我们比较了三种用于分布式最近邻居搜索的策略：1）迭代广播实例，2）使用基于树的分布式索引结构，以及3）建立哈希表以对实例进行分组。实验研究评估了22个基准数据集的预测质量和运行时之间的权衡，并比较了使用不同数据量的可伸缩性。结果表明，基于树的索引策略的性能优于其他方法，最大数据集的速度提高了266倍，同时实现了与精确方法相当的准确性。这种策略使ML-KNN可以有效地扩展问题的规模。（C）2018 Elsevier B.V.保留所有权利。

著录项

来源
《Future generation computer systems》 |2018年第10期|66-82|共17页
作者
Gonzalez-Lopez Jorge; Ventura Sebastian; Cano Alberto;
展开▼
作者单位

Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA;

Univ Cordoba, Dept Comp Sci & Numer Anal, Cordoba, Spain|King Abdulaziz Univ, Comp & Informat Technol, Jeddah, Saudi Arabia|Maimonides Biomed Res Inst Cordoba, Cordoba, Spain;

Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Apache spark; MapReduce; Distributed computing; Big data; Multi-label classification; Nearest neighbors;

机译：Apache spark;MapReduce;分布式计算;大数据;多标签分类;最近的邻居;

相似文献

外文文献
中文文献
专利

1. Editing training data for multi-label classification with the k-nearest neighbor rule [J] . Kanj Sawsan, Abdallah Fahed, Denoeux Thierry, Pattern Analysis and Applications . 2016,第1期

机译：使用k最近邻规则编辑训练数据以进行多标签分类
2. EFFECTIVENESS OF K-MEANS CLUSTERING TO DISTRIBUTE TRAINING DATA AND TESTING DATA ON K-NEAREST NEIGHBOR CLASSIFICATION [J] . MUSTAKIM Journal of Theoretical and Applied Information Technology . 2017,第21期

机译：K均值聚类在K近邻分类中对训练数据和测试数据进行分布的有效性
3. Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark [J] . Sergio Ramírez-Gallego, Bartosz Krawczyk, Salvador García, IEEE Transactions on Systems, Man, and Cybernetics . 2017,第10期

机译：使用Spark的高速大数据流的最近邻居分类
4. Tissue Classification of Large-scale Multi-site MR Data Using Fuzzy k-Nearest Neighbor Method [C] . Ali Ghayoor, Jane S. Paulsen, Regina E. Y. Kim, Conference on imaging processing . 2016

机译：模糊k近邻法对大型多站点MR数据进行组织分类
5. Zero-day Attack Identification in Streaming Data: Nearest Neighbor Heuristics and Dynamic Semantic Network Generation in the Spark Eco-system [D] . Pallaprolu, Sai Chaithanya. 2017

机译：流数据中的零日攻击识别：Spark生态系统中的最近邻居启发式算法和动态语义网络生成
6. A Sensor Data Fusion System Based on k-Nearest Neighbor Pattern Classification for Structural Health Monitoring Applications [O] . Jaime Vitola, Francesc Pozo, Diego A. Tibaduiza, 2017

机译：基于k-最近邻模式分类的传感器数据融合系统在结构健康监测中的应用
7. Editing training data for multi-label classification with the k-nearest neighbor rule [O] . Kanj, Sawsan, Abdallah, Fahed, Denoeux, Thierry, 2016

机译：使用k最近邻规则编辑训练数据以进行多标签分类

Distributed nearest neighbor classification for large-scale multi-label data on spark

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅