首页> 外文期刊>Computing >Screening hardware and volume factors in distributed machine learning algorithms on spark: A design of experiments (DoE) based approach
【24h】

Screening hardware and volume factors in distributed machine learning algorithms on spark: A design of experiments (DoE) based approach

机译:筛选在火花上分布式机器学习算法中的筛选硬件和体积因子:基于实验的设计(DOE)方法

获取原文
获取原文并翻译 | 示例
       

摘要

This paper presents an approach to investigate distributed machine learning workloads on Spark. The work analyzes hardware and volume data factors regarding time and cost performance when applying machine learning (ML) techniques in big data scenarios. The method is based on the Design of Experiments (DoE) approach and applies randomized two-level fractional factorial design with replications to screening the most relevant factors. A Web Corpus was built from 16 million webpages from Portuguese-speaking countries. The application was a binary text classification to distinguish Brazillian Portuguese from other variations. Five different machine learning algorithms were examined: Logistic Regression, Random Forest, Support Vector Machines, Naive Bayes and Multilayer Perceptron. The data was processed using real clusters having up to 28 nodes, each composed of 12 or 32 cores, 1 or 7 SSD disks, and 3x or 6x RAM per core, totalizing a maximum computational power of 896 cores and 5.25 TB RAM. Linear models were applied to identify, analyze and rank the influence of factors. A total of 240 experiments were carefully organized to maximize the detection of non-cofounded effects up to the second-order, minimizing the experimental efforts. Our results include linear models to estimate time and cost performance, statistical inferences about effects, and a visualization tool based on parallel coordinates to aid decision making about cluster configuration.
机译:本文介绍了调查火花上的分布式机器学习工作负载的方法。该工作分析了在大数据场景中应用机器学习(ML)技术时的时间和成本性能的硬件和卷数据因素。该方法基于实验(DOE)方法的设计,并使用可随机的两级分数因子设计进行复制,以筛选最相关的因素。 Web语料库是由葡萄牙语国家的1600万个网页建造。该申请是二进制文本分类,以区分Brazillian葡萄牙语与其他变化。检查了五种不同的机器学习算法:Logistic回归,随机森林,支持向量机,天真贝叶斯和多层的感觉。使用最多28个节点的实际集群处理数据,每个群集由12或32个核心,1或7个SSD磁盘和3倍或6倍RAM组成,总计896核和5.25 TB RAM的最大计算功率。应用线性模型来识别,分析和排列因素的影响。共组织共240个实验,以最大限度地检测到二阶的非Cofound影响,最大限度地减少实验努力。我们的结果包括线性模型来估算时间和成本性能,统计推论的效果,以及基于并行坐标的可视化工具,以帮助决策簇配置。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号