首页> 外文会议>International Conference on Computer Science and Engineering >Data Mining Library for Big Data Processing Platforms: A Case Study-Sparkling Water Platform
【24h】

Data Mining Library for Big Data Processing Platforms: A Case Study-Sparkling Water Platform

机译:大数据处理平台的数据挖掘库:案例研究-气泡水平台

获取原文

摘要

Nowadays, many data from millions of websites, applications, social media resources, surveys, video surveillance platforms, and many other sources are obtained in a very large amount. By processing large datasets that occur every day, useful information can be derived. Distributed data processing platforms are needed to handle large amounts of data. For big data processing and analytics platforms such as Hadoop and Spark, there are machine learning libraries that operates distributed and exploits the advantages of distributed computing. For example; The Mahout library uses the Hadoop platform, while the Spark-MLLib library uses the Spark platform. However, for these platforms, it seems that there is no implementation for the algorithms included in the data mining steps, or there is only the implementation for some of the steps' algorithms. Within the scope of this research, algorithms in different data mining steps on a large data platform will be implemented and a performance evaluation will be performed. In the context of this research, as a case study, the Sparkling Water platform was chosen as a major data processing platform. The banking data set was used for the tests of the implemented data mining algorithms. A software layer containing all data mining steps was developed on the Sparkling Water platform and performance evaluation was conducted. As a result of the evaluation, it has been observed that performance enhancement which comes with distributed data processing has been successful.
机译:如今,从数以百万计的网站,应用程序,社交媒体资源,调查,视频监视平台以及许多其他来源获得的许多数据量非常大。通过处理每天发生的大型数据集,可以得出有用的信息。需要使用分布式数据处理平台来处理大量数据。对于诸如Hadoop和Spark之类的大数据处理和分析平台,有一些机器学习库可以分布式运行并利用分布式计算的优势。例如; Mahout库使用Hadoop平台,而Spark-MLLib库使用Spark平台。但是,对于这些平台,似乎没有实现数据挖掘步骤中包含的算法的实现,或者仅存在某些步骤的算法的实现。在本研究的范围内,将在大型数据平台上实施不同数据挖掘步骤中的算法,并将进行性能评估。在本研究的背景下,作为案例研究,选择了苏打水平台作为主要的数据处理平台。银行数据集用于测试已实施的数据挖掘算法。在Sparkling Water平台上开发了包含所有数据挖掘步骤的软件层,并进行了性能评估。评估的结果表明,分布式数据处理所带来的性能提升已获得成功。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号