...
首页> 外文期刊>Computers & operations research >Evaluating the performance of cost-based discretization versus entropy- and error-based discretization
【24h】

Evaluating the performance of cost-based discretization versus entropy- and error-based discretization

机译:评估基于成本的离散化与基于熵和错误的离散化的性能

获取原文
获取原文并翻译 | 示例
           

摘要

Discretization is defined as the process that divides continuous numeric values into intervals of discrete categorical values. In this article, the concept of cost-based discretization as a pre-processing step to the induction of a classifier is introduced in order to obtain an optimal multi-interval splitting for each numeric attribute. A transparent description of the method and the steps involved in cost-based discretization are given. The aim of this paper is to present this method and to assess the potential benefits of such an approach. Furthermore, its performance against two other well-known methods, i.e. entropy- and pure error-based discretization is examined. To this end, experiments on 14 data sets, taken from the UCI Repository on Machine Learning were carried out. In order to compare the different methods, the area under the Receiver Operating Characteristic (ROC) graph was used and tested on its level of significance. For most data sets the results show that cost-based discretization achieves satisfactory results when compared to entropy- and error-based discretization. Given its importance, many researchers have already contributed to the issue of discretization in the past. However, to the best of our knowledge, no efforts have been made yet to include the concept of misclassification costs to find an optimal multi-split for discretization purposes, prior to induction of the decision tree. For this reason, this new concept is introduced and explored in this article by means of operations research techniques.
机译:离散化定义为将连续数值分成离散类别值的间隔的过程。在本文中,引入了基于成本的离散化概念作为分类器归纳的预处理步骤,以便为每个数字属性获得最佳的多间隔拆分。给出了基于成本的离散化方法和步骤的透明描述。本文的目的是介绍这种方法并评估这种方法的潜在好处。此外,检查了它相对于其他两种众所周知的方法的性能,即基于熵和纯误差的离散化。为此,从UCI机器学习知识库中提取了14个数据集的实验。为了比较不同的方法,使用了接收器工作特征(ROC)图下方的区域并对其重要性进行了测试。对于大多数数据集,结果表明,与基于熵和基于误差的离散化相比,基于成本的离散化取得了令人满意的结果。鉴于其重要性,过去许多研究人员已经为离散化问题做出了贡献。但是,据我们所知,尚未做出任何努力来包括误分类成本的概念,以便在归纳决策树之前为离散化目的找到最佳的多重分割。因此,本文将通过运筹学技术介绍和探索这一新概念。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号