...
首页> 外文期刊>Entropy >Merging of Numerical Intervals in Entropy-Based Discretization
【24h】

Merging of Numerical Intervals in Entropy-Based Discretization

机译:基于熵的离散化中数值区间的合并

获取原文
   

获取外文期刊封面封底 >>

       

摘要

As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches.
机译:正如先前的研究表明,基于熵的用于数值数据集离散化的多重扫描方法非常有竞争力。离散化是将数据记录的数值转换为与在数据记录的域上定义的数字间隔关联的离散值的过程。在多次扫描离散化中,最后一步是将离散化数据集中的相邻区间合并为一种后处理。我们的目标是检查通过C4.5系统进行十倍交叉验证所测得的错误率如何受到这种合并的影响。我们使用相同的多重扫描设置对17个数值数据集进行了实验,并提供三种不同的合并选项:完全不合并,基于最小熵的合并和基于最大熵的合并。弗里德曼秩和检验(显着性水平为5%)的结果是,我们得出结论,所有这三种方法之间的差异在统计上均不显着。没有普遍最佳的方法。然后,我们重复了所有实验30次,记录了平均值和标准偏差。对平均值之间的差异进行的检验表明,对于不合并与基于最小熵的合并的比较,存在统计学上的显着差异(显着性水平为1%)。在某些情况下,较小的错误率与没有合并相关联,在某些情况下,较小的错误率与基于最小熵的合并相关联。不合并与基于最大熵的合并的比较显示了相似的结果。因此,我们的最终结论是,根据数据集的不同,在没有合并和合并之间存在非常显着的差异。应该通过尝试所有三种方法来选择最佳方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号