基于Spark的层次聚类算法的研究与应用

刘卫华; 史婷婷; 许学添

摘要

信息化高速发展的时代，信息数据大量产生，如没得到较好的整理归类，就无法满足对数据查找和使用上的快捷便利与准确性。随着信息安全科学技术的发展，这些数据在整理分类上的需求日益增长，但是在传统的聚类算法上，已经不能满足现在信息数据处理的需要。因此，对原算法的优化改进或重建新的算法成为现在最为迫切的事情。同时，在海量的数据处理上，单台计算机的硬件设施也无法满足对数据处理分类的需求。针对上述情况，基于Spark在分布式计算框架的基础上，本文对聚类算法进行了优化改进。利用Apache Spark的大数据处理框架，扩展了对计算模型的使用，并在内存上提供可以并行的计算框架，利用借着中间结果缓存在内存中，减少磁盘I/O的重复操作次数，从而可以更好地为迭代式计算、交互式查询等多种计算需求服务。通过对聚类算法的优化提高对数据分析处理归类的计算效率，实现本文研究的意义。 In the era of rapid development of information technology, a large number of information data are generated. If they are not properly sorted and classified, they cannot meet the requirements of fast, convenient and accurate data search and use. With the development of information security science and technology, the demand for sorting and sorting of these data is increasing, but the traditional clustering algorithm can no longer meet the needs of current information data processing. Therefore, the optimization and improvement of the original algorithm or the reconstruction of a new algorithm has become the most urgent thing now. At the same time, on huge amounts of data processing, a single computer hardware facility cannot meet the demand of classification of data processing. According to the above situation, this article is based on the Spark in a distributed compu-ting framework, on the basis of the clustering algorithm is optimized to improve. The use of Apache Spark's big data processing framework extends the use of the computing model, and provides a parallel computing framework in memory. By caching intermediate results in memory, the number of repeated disk I/O operations can be reduced, so as to better serve the needs of iterative computing, interactive query and other computing requirements. Through the optimization of clustering algorithm to improve the computational efficiency of data analysis, processing and classification, the significance of this study is realized.

机译：信息化高速发展的时代，信息数据大量产生，如没得到较好的整理归类，就无法满足对数据查找和使用上的快捷便利与准确性。随着信息安全科学技术的发展，这些数据在整理分类上的需求日益增长，但是在传统的聚类算法上，已经不能满足现在信息数据处理的需要。因此，对原算法的优化改进或重建新的算法成为现在最为迫切的事情。同时，在海量的数据处理上，单台计算机的硬件设施也无法满足对数据处理分类的需求。针对上述情况，基于Spark在分布式计算框架的基础上，本文对聚类算法进行了优化改进。利用Apache Spark的大数据处理框架，扩展了对计算模型的使用，并在内存上提供可以并行的计算框架，利用借着中间结果缓存在内存中，减少磁盘I/O的重复操作次数，从而可以更好地为迭代式计算、交互式查询等多种计算需求服务。通过对聚类算法的优化提高对数据分析处理归类的计算效率，实现本文研究的意义。 In the era of rapid development of information technology, a large number of information data are generated. If they are not properly sorted and classified, they cannot meet the requirements of fast, convenient and accurate data search and use. With the development of information security science and technology, the demand for sorting and sorting of these data is increasing, but the traditional clustering algorithm can no longer meet the needs of current information data processing. Therefore, the optimization and improvement of the original algorithm or the reconstruction of a new algorithm has become the most urgent thing now. At the same time, on huge amounts of data processing, a single computer hardware facility cannot meet the demand of classification of data processing. According to the above situation, this article is based on the Spark in a distributed compu-ting framework, on the basis of the clustering algorithm is optimized to improve. The use of Apache Spark's big data processing framework extends the use of the computing model, and provides a parallel computing framework in memory. By caching intermediate results in memory, the number of repeated disk I/O operations can be reduced, so as to better serve the needs of iterative computing, interactive query and other computing requirements. Through the optimization of clustering algorithm to improve the computational efficiency of data analysis, processing and classification, the significance of this study is realized.

基于Spark的层次聚类算法的研究与应用

摘要

著录项

相关主题

期刊订阅