首页> 外文会议>International Conference on Big Data Artificial Intelligence Software Engineering >Research on Mass News Classification Algorithm Based on Spark
【24h】

Research on Mass News Classification Algorithm Based on Spark

机译:基于火花的大规模新闻分类算法研究

获取原文

摘要

In recent years, with the explosion of the number of Internet news, people pay more and more attention to how to classify the mass of news. Therefore, this paper studies the mass news classification algorithm based on Spark, aiming at the problem of how to classify mass news data quickly and efficiently. In this paper, a large amount of news text is segmented based on Jieba segmentation tool, and several versions of stop words list are combined to remove stop words. Secondly, on the basis of traditional convolutional neural network, this paper proposes a news classification algorithm based on the combination of pre-trained Word2vec and improved CNN. In addition, the classification algorithm proposed in this paper is parallelized based on Spark, which improves the speed of mass news classification. In this paper, the standard data sets are used to compare and experiment the proposed news classification algorithm. The experimental results show that compared with the traditional algorithm, the news classification optimization algorithm designed in this paper has obvious improvement in multiple evaluation indexes such as accuracy, recall and F1. In addition, after parallel design of the algorithm proposed in this paper based on Spark, compared with the serial algorithm, the speed improvement effect is also more significant.
机译:近年来,随着互联网新闻数量的爆炸,人们越来越多地关注如何归类的消息。因此,本文研究了基于火花的质量新闻分类算法,旨在快速有效地分类质量新闻数据的问题。在本文中,基于Jieba分段工具对大量新闻文本进行了分割,并组合了几个版本的停止单词列表以删除停止单词。其次,在传统的卷积神经网络的基础上,本文提出了一种基于预训练的Word2VEC和改进的CNN的组合的新闻分类算法。另外,本文提出的分类算法基于火花并行化,从而提高了大规模新闻分类的速度。在本文中,标准数据集用于比较和实验所提出的新闻分类算法。实验结果表明,与传统算法相比,本文设计的新闻分类优化算法在多种评估指标中具有明显的改进,如准确性,召回和F1。另外,在本文提出的算法的并行设计之后,基于火花,与串行算法相比,速度提高效果也更为显着。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号