...
【24h】

A two-stage discretization algorithm based on information entropy

机译:基于信息熵的两级离散化算法

获取原文
获取原文并翻译 | 示例

摘要

Discretization is an important and difficult preprocessing task for data mining and knowledge discovery. Although there are numerous discretization approaches, many suffer from certain drawbacks. Local approaches are efficient, but their generalization ability is weak. Global approaches consider all attributes simultaneously, but they have high time and space complexities. In this paper, we propose a two-stage discretization (TSD) algorithm based on information entropy. In the local discretization stage, we independently select k strong cuts for each attribute to minimize conditional entropy. The goal is to rapidly reduce the cardinality of the attributes, with minor information loss. In the global discretization stage, cuts for all attributes are considered simultaneously to form a scaled decision system. The minimal cut set that preserves the positive region is finally selected. We tested the new algorithm and seven popular algorithms on 28 datasets. Compared with other approaches, our algorithm has the best generalization ability, with a good information preserving ability, the highest classification accuracy, and reasonable time consumption.
机译:离散化是数据挖掘和知识发现的重要和困难的预处理任务。虽然有许多离散化方法,但许多人遭受某些缺点。局部方法是有效的,但它们的泛化能力很弱。全局方法同时考虑所有属性,但它们具有高时间和空间复杂性。在本文中,我们提出了一种基于信息熵的两级离散化(TSD)算法。在局部离散化阶段,我们为每个属性独立选择k强切割,以最大限度地减少条件熵。目标是迅速减少属性的基数,小信息丢失。在全局离散化阶段,同时考​​虑所有属性的削减以形成缩放决策系统。最终选择保留正区域的最小切割集。我们在28个数据集上测试了新的算法和七个流行算法。与其他方法相比,我们的算法具有最佳的泛化能力,具有良好的信息保存能力,最高分类准确性,以及合理的时间消耗。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号