Large-scale multi-label ensemble learning on Spark

机译：大规模的多标签集合在火花上学习

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Multi-label learning is a challenging problem which has received growing attention in the research community over the last years. Hence, there is a growing demand of effective and scalable multi-label learning methods for larger datasets both in terms of number of instances and numbers of output labels. The use of ensemble classifiers is a popular approach for improving multi-label model accuracy, especially for datasets with high-dimensional label spaces. However, the increasing computational complexity of the algorithms in such ever-growing high-dimensional label spaces, requires new approaches to manage data effectively and efficiently in distributed computing environments. Spark is a framework based on MapReduce, a distributed programming model that offers a robust paradigm to handle large-scale datasets in a cluster of nodes. This paper focuses on multi-label ensembles and proposes a number of implementations through the use of parallel and distributed computing using Spark. Additionally, five different implementations are proposed and the impact on the performance of the ensemble is analyzed. The experimental study shows the benefits of using distributed implementations over the traditional single-node single-thread execution, in terms of performance over multiple metrics as well as significant speedup tested on 29 benchmark datasets.

机译：多标签学习是一个具有挑战性的问题，在过去几年中受到研究界的关注。因此，就输出标签的实例数和数量而言，对较大的数据集的有效和可扩展的多标签学习方法越来越大。 Ensemble Classifiers的使用是一种流行的方法，可以提高多标签模型精度，尤其是具有高维标签空间的数据集。然而，在这种不断增长的高维标记空间中增加算法的计算复杂性，需要新方法来在分布式计算环境中有效且有效地管理数据。 Spark是一种基于MapReduce的框架，一个分布式编程模型，提供了一种强大的范例来处理节点集群中的大规模数据集。本文侧重于多标签集合，并通过使用Spark的并行和分布式计算提出了许多实现。另外，提出了五种不同的实施，并分析了对集合性能的影响。实验研究表明，在多个度量标准的性能方面，在传统的单节点单线执行方面使用分布式实现的好处以及在29个基准数据集上测试的显着加速。

著录项

来源
《IEEE International Conference on Trust, Security and Privacy in Computing and Communications》|2017年|578-1150p|共8页
会议地点
作者
Jorge Gonzalez-Lopez; Alberto Cano; Sebastian Ventura;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP393.08-53;
关键词
Multi-label learning; Ensemble learning; Distributed computing; Apache Spark; Big data;

机译：多标签学习;集合学习;分布式计算;Apache Spark;大数据;

相似文献

外文文献
中文文献
专利

1. Predicting drug side effects by multi-label learning and ensemble learning [J] . Wen Zhang, Feng Liu, Longqiang Luo, BMC Bioinformatics . 2015,第1期

机译：通过多标签学习和集成学习预测药物副作用
2. Distributed nearest neighbor classification for large-scale multi-label data on spark [J] . Gonzalez-Lopez Jorge, Ventura Sebastian, Cano Alberto Future generation computer systems . 2018,第OCTa期

机译：针对Spark上的大规模多标签数据的分布式最近邻分类
3. Multi-label learning with label-specific features via weighting and label entropy guided clustering ensemble [J] . Zhang Chunyu, Li Zhanshan Neurocomputing . 2021,第Jana2期

机译：多标签学习通过加权和标签熵引导群集集群合奏
4. Large-scale multi-label ensemble learning on Spark [C] . Jorge Gonzalez-Lopez, Alberto Cano, Sebastian Ventura IEEE International Conference on Trust, Security and Privacy in Computing and Communications . 2017

机译：大规模的多标签集合在火花上学习
5. An Investigation on Deep Learning and Multi-label Learning for Composite System Reliability Evaluation [D] . Urgun, Dogan. 2019

机译：复合系统可靠性评估深度学习与多标签学习的调查
6. Predicting drug side effects by multi-label learning and ensemble learning [O] . Wen Zhang, Feng Liu, Longqiang Luo, 2015

机译：通过多标签学习和集成学习预测药物副作用
7. Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models [O] . Papanikolaou, Yannis, Tsoumakas, Grigorios, Laliotis, Manos, 2017

机译：生物医学论文的大规模在线语义索引多标签分类模型的集合

Large-scale multi-label ensemble learning on Spark

摘要

著录项

相似文献

相关主题

期刊订阅