首页> 外文期刊>The Knowledge Engineering Review >Discretization as the enabling technique for the Naieve Bayes and semi-Naieve Bayes-based classification
【24h】

Discretization as the enabling technique for the Naieve Bayes and semi-Naieve Bayes-based classification

机译:离散化作为朴素贝叶斯和基于半朴素贝叶斯分类的支持技术

获取原文
获取原文并翻译 | 示例
       

摘要

Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naive Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 dis-cretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (1EM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naive Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naive Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.
机译:当前涉及大而不断增长的数据集的分类问题需要可扩展的分类算法。在这项研究中,我们专注于几种可扩展的线性复杂度分类器,其中包括投票率最高的10种数据挖掘方法之一,朴素贝叶斯(NB)以及最近提出的几种半NB分类器。这些算法执行连续特征的前端离散化,因为通过设计,它们仅适用于名义或离散特征。我们解决了在后续分类的背景下缺乏研究离散化的利弊的研究。我们的综合实证研究考虑了12个分散器(两个无监督和10个监督),七个分类器(两个经典NB和五个半NB)以及16个数据集。我们研究了离散化器的可扩展性,并显示出最快的受监督离散化器可快速实现类属性相互依赖最大化(FCAIM),类属性相互依赖最大化(CAIM)和信息熵最大化(1EM)提供具有最高总体质量的离散化方案。我们表明,与在原始数据上执行的两种经典方法NB和Flexible Naive Bayes(FNB)相比,离散化提高了分类的准确性。离散化算法的选择会影响改进的重要性。 MODL,FCAIM和CAIM方法提供了统计上显着的改进,而IEM,类别属性权变系数(CACC)和Khiops离散器提供了适度的改进。最准确的分类模型是由平均一依赖估计量(AODEsr)分类器生成的,随后是AODE和HNB(隐藏的朴素贝叶斯)。 AODEsr在使用MODL,FCAIM和CAIM离散化的数据上运行,与两种传统的NB方法相比,其统计学上的准确性要高得多。使用NB,FNB和LBR(惰性贝叶斯规则)分类器可获得最差的结果。我们表明,尽管建立离散化方案的时间可能比训练分类器的时间更长,但整个过程(离散化数据,计算分类器和预测测试实例)的完成过程通常比基于NB的速度更快。连续实例的分类。这是因为对测试实例进行分类的时间是受离散化影响的重要因素。对准确性和分类时间的最大正面影响与MODL,FCAIM和CAIM算法有关。

著录项

  • 来源
    《The Knowledge Engineering Review》 |2010年第4期|p.421-449|共29页
  • 作者单位

    Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada;

    Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada;

    Bio-cybernetics Laboratory, Institute of Automatics, AGH University of Science and Technology, Krakow, Poland;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-18 00:38:55

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号