首页> 外文会议>International conference on cloud computing;World Congress on Services >A Parallel Algorithm for Bayesian Text Classification Based on Noise Elimination and Dimension Reduction in Spark Computing Environment
【24h】

A Parallel Algorithm for Bayesian Text Classification Based on Noise Elimination and Dimension Reduction in Spark Computing Environment

机译:星火计算环境中基于降噪降维的贝叶斯文本分类并行算法

获取原文

摘要

The Naive Bayesian algorithm is one of the ten classical algorithms in data milling, which is widely used as the basic theory for text classification. With the high-speed development of the Internet and information systems, huge amount of data are being produced all the time. Some problems are certain to arise when the traditional Bayesian classification algorithm addresses massive amount of data, especially without the parallel computing framework. This paper proposes an improved Bayesian algorithm INBCS, for text classification in the Spark computing environment and improves the Naive Bayesian algorithm based on a polynomial model. For the data preprocessing, this paper first proposes a parallel noise elimination algorithm, and then proposes another parallel dimension reduction algorithm based on Information Gain and TextRank computation in the Spark environment. Based on these prepro-cessed data, an improved parallel method is proposed for calculating the conditional probability that comprehensively considers the effects of the feature items in each document, class and training set. Finally, through experiments on different widely used corpuses on the Spark computation platform, the results illustrate that INBCS can obtain higher accuracy and efficiency than some current improvements and implementations of the Naive Bayesian algorithms in Spark ML-library.
机译:朴素贝叶斯算法是数据铣削中的十种经典算法之一,被广泛用作文本分类的基础理论。随着Internet和信息系统的高速发展,一直在产生大量数据。当传统的贝叶斯分类算法处理大量数据时,尤其是在没有并行计算框架的情况下,肯定会出现一些问题。提出了一种改进的贝叶斯算法INBCS,用于Spark计算环境中的文本分类,并改进了基于多项式模型的朴素贝叶斯算法。对于数据预处理,本文首先提出了一种并行降噪算法,然后提出了一种基于Spark环境下基于信息增益和TextRank计算的并行降维算法。基于这些预处理数据,提出了一种改进的并行方法来计算条件概率,该条件概率综合考虑了每个文档,类和训练集中的特征项的影响。最后,通过在Spark计算平台上使用不同的语料库进行实验,结果表明,相比于Spark ML库中朴素贝叶斯算法的一些当前改进和实现,INBCS可以获得更高的准确性和效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号