首页> 外文学位 >Studying Recommender Systems to Enhance Distributed Computing Schedulers.
【24h】

Studying Recommender Systems to Enhance Distributed Computing Schedulers.

机译:研究推荐系统以增强分布式计算调度程序。

获取原文
获取原文并翻译 | 示例

摘要

Distributed Computing frameworks belong to a class of programming models that allow developers to launch workloads on large clusters of machines. Due to the dramatic increase in the volume of data gathered by ubiquitous computing devices, data analytic workloads have become a common case among distributed computing applications, making Data Science an entire field of Computer Science. We argue that Data Scientist's concern lays in three main components: a dataset, a sequence of operations they wish to apply on this dataset, and some constraint they may have related to their work (performances, QoS, budget, etc). However, it is actually extremely difficult, without domain expertise, to perform data science. One need to select the right amount and type of resources, pick up a framework, and configure it. Also, users are often running their application in shared environments, ruled by schedulers expecting them to specify precisely their resource needs. Inherent to the distributed and concurrent nature of the cited frameworks, monitoring and profiling are hard, high dimensional problems that block users from making the right configuration choices and determining the right amount of resources they need. Paradoxically, the system is gathering a large amount of monitoring data at runtime, which remains unused.;In the ideal abstraction we envision for data scientists, the system is adaptive, able to exploit monitoring data to learn about workloads, and process user requests into a tailored execution context. In this work, we study different techniques that have been used to make steps toward such system awareness, and explore a new way to do so by implementing machine learning techniques to recommend a specific subset of system configurations for Apache Spark applications. Furthermore, we present an in depth study of Apache Spark executors configuration, which highlight the complexity in choosing the best one for a given workload.
机译:分布式计算框架属于一类编程模型,允许开发人员在大型计算机集群上启动工作负载。由于无处不在的计算设备收集的数据量急剧增加,因此数据分析工作负载已成为分布式计算应用程序中的常见情况,从而使Data Science成为计算机科学的整个领域。我们认为数据科学家的关注点在于三个主要组成部分:数据集,他们希望对该数据集应用的一系列操作以及它们可能与工作有关的某些约束(性能,QoS,预算等)。但是,如果没有领域专业知识,执行数据科学实际上非常困难。需要选择正确数量和类型的资源,选择一个框架并进行配置。同样,用户经常在共享环境中运行其应用程序,这是由调度程序所期望的,他们希望他们精确指定其资源需求。由于引用的框架具有分布式和并行性,因此监视和概要分析是困难的,高维度的问题,会阻止用户进行正确的配置选择和确定所需的正确资源数量。矛盾的是,该系统正在运行时收集大量监视数据,而这些数据仍未使用。;在我们为数据科学家设想的理想抽象中,该系统是自适应的,能够利用监视数据来了解工作负载,并将用户请求处理为量身定制的执行上下文。在这项工作中,我们研究了用于提高系统知名度的各种技术,并通过实现机器学习技术为Apache Spark应用程序推荐特定的系统配置子集,探索了一种新的方法。此外,我们对Apache Spark执行程序的配置进行了深入研究,突出了针对给定工作负载选择最佳配置的复杂性。

著录项

  • 作者

    Demoulin, Henri Maxime.;

  • 作者单位

    Duke University.;

  • 授予单位 Duke University.;
  • 学科 Computer science.
  • 学位 M.S.
  • 年度 2016
  • 页码 86 p.
  • 总页数 86
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号