【24h】

Benchmarking Spark Machine Learning Using BigBench

机译:使用Bigbench基准火花机学习

获取原文

摘要

Databases such as dashDB are adding High Speed Connectors for Spark to efficiently extract large volumes of data. This allows them to be combined with other unstructured data sources and perform Machine Learning (ML) on top of it. Machine Learning is a key ingredient for such use cases. In order to assess performance of the data connectors and machine language frameworks, we sought benchmarks that have the ability to scale the size of datasets to very large volumes and apply Machine Learning algorithms. After exploring several options, we found BigBench to be a good fit. In this paper, we talk about our experiences of using BigBench with special focus on its 5 Machine Learning queries and their default implementation in Spark. We discuss on how we could improve effectiveness of BigBench for benchmarking Machine Learning by avoiding bias and inclusion of real time analytics. We also think that there is scope for improving the coverage of Machine Learning by adding more use cases like Collaborative Filtering. Lastly, we share some interesting visualization of 4 ML queries using SPSS Modeler and our experiments on different Clustering and Classification algorithms.
机译:DashDB等数据库正在添加用于火花的高速连接器,以有效提取大量数据。这允许它们与其他非结构化数据源组合并在其顶部执行机器学习(ML)。机器学习是这种用例的关键因素。为了评估数据连接器和机器语言框架的性能,我们寻求具有将数据集大小扩展到非常大卷的能力的基准,并应用机器学习算法。在探索多种选择之后,我们发现Bigbench是一个很好的合适。在本文中,我们讨论了使用BigBench的经验,并特别关注其5台机器学习查询及其在Spark中的默认实现。我们讨论如何通过避免偏见和纳入实时分析来提高基准机器学习的Bigbench的有效性。我们还认为,通过添加更多用例,可以提高机器学习的覆盖范围,如同协作滤波等更多使用情况。最后,我们使用SPSS Modeler和我们在不同聚类和分类算法上的实验分享了4 ML查询的一些有趣的可视化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号