首页> 外文会议>International Conference on Electrical and Electronics Engineering >Performance Comparison of Naïve Bayes and Complement Naïve Bayes Algorithms
【24h】

Performance Comparison of Naïve Bayes and Complement Naïve Bayes Algorithms

机译:朴素贝叶斯算法和互补朴素贝叶斯算法的性能比较

获取原文

摘要

Big data is defined with 3 V which are volume, velocity and variety. It is hard to analyze, store and process this data because of its size and complexity. When traditional tools are used to analyze the data, execution time is too much. On the other hand, there are some tools and libraries to analyze and process the big data. As a result, it does not take too much time to analyze and process the data. For example; Hadoop is an open source library that allow the distributed computing for large datasets. Mahout is a library that allows machine learning, Hive allows querying and Kafka allows messaging. In this paper, Hadoop and Mahout are used and performance of Naïve Bayes and Complement Naïve Bayes Algorithms are compared based on average correctly classified instance percentage, average training time and average testing time with different size of the dataset. As a dataset, "20 Newsgroups" is used and size of the dataset is increased by scaling the dataset with 2, 4 and 8. As a result, datasets with the size of 37692, 75384 and 150768 are created. All experiments are carried on with all the datasets using different smoothing, weight and normalization parameters for 10 times and then, average of all the results are taken. After all the experiments, it is observed that performance of Naïve Bayes Algorithm is better than Complement Naïve Bayes Algorithm based on average training time. On the other hand, performance of Complement Naïve Bayes is better than the other based on average correctly classified instance percentage.
机译:大数据定义为3 V,即体积,速度和种类。由于数据的大小和复杂性,很难对其进行分析,存储和处理。当使用传统工具分析数据时,执行时间过多。另一方面,有一些工具和库可用于分析和处理大数据。结果,不需要花费太多时间来分析和处理数据。例如; Hadoop是一个开放源代码库,允许对大型数据集进行分布式计算。 Mahout是一个允许机器学习的库,Hive允许查询,而Kafka则允许消息传递。在本文中,使用了Hadoop和Mahout,并根据正确分类的实例平均百分比,平均训练时间和不同数据集大小的平均测试时间比较了朴素贝叶斯算法和互补朴素贝叶斯算法的性能。作为数据集,使用“ 20个新闻组”,并通过将数据集缩放为2、4和8来增加数据集的大小。结果,创建了大小为37692、75384和150768的数据集。使用不同的平滑,权重和归一化参数对所有数据集进行所有实验10次,然后取所有结果的平均值。经过所有实验,基于平均训练时间,发现朴素贝叶斯算法的性能优于互补朴素贝叶斯算法。另一方面,基于平均正确分类实例百分比,ComplementNaïveBayes的性能要优于其他方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号