首页> 外文学位 >Knowledge Discovery from Databases: Cost-sensitive and imbalance learning.
【24h】

Knowledge Discovery from Databases: Cost-sensitive and imbalance learning.

机译:从数据库中发现知识:成本敏感和不平衡的学习。

获取原文
获取原文并翻译 | 示例

摘要

In the current business world, data collection for business analysis is not difficult any more. The major concern faced by business managers is whether they can use data to build predictive models so as to provide accurate information for decision-making. Knowledge Discovery from Databases (KDD) provides us a guideline for collecting data through identifying knowledge inside data. As one of the KDD steps, the data mining method provides a systematic and intelligent approach to learning a large amount of data and is critical to the success of KDD. In the past several decades, many different data mining algorithms have been developed and can be categorized as classification, association rule, and clustering. These data mining algorithms have been demonstrated to be very effective in solving different business questions. Among these data mining types, classification is the most popular group and is widely used in all kinds of business areas. However, the exiting classification algorithm is designed to maximize the prediction accuracy given by the assumption of equal class distribution and equal error costs. This assumption seldom holds in the real world. Thus, it is necessary to extend the current classification so that it can deal with the data with the imbalanced distribution and unequal costs. In this dissertation, I propose an Iterative Cost-sensitive Naive Bayes (ICSNB) method aimed at reducing overall misclassification cost regardless of class distribution. During each iteration, k nearest neighbors are identified and form a new training set, which is used to learn unsolved instances. Using the characteristics of the nearest neighbor method, I also develop a new under-sampling method to solve the imbalance problem in the second study. In the second study, I design a general method to deal with the imbalance problem and identify noisy instances from the data set to create a balanced data set for learning. Both of these two methods are validated using multiple real world data sets. The empirical results show the superior performance of my methods compared to some existing and popular methods.
机译:在当前的商业世界中,用于业务分析的数据收集不再困难。业务经理面临的主要问题是,他们是否可以使用数据来建立预测模型,以便为决策提供准确的信息。数据库知识发现(KDD)为我们提供了通过识别数据内部知识来收集数据的指南。作为KDD的步骤之一,数据挖掘方法提供了一种系统的,智能的方法来学习大量数据,这对于KDD的成功至关重要。在过去的几十年中,已经开发了许多不同的数据挖掘算法,可以将其分类为分类,关联规则和聚类。这些数据挖掘算法已被证明在解决不同的业务问题方面非常有效。在这些数据挖掘类型中,分类是最受欢迎的组,并广泛用于各种业务领域。但是,现有分类算法的设计目的是使类别分布相同和错误成本相等的假设所给的预测准确性最大化。这个假设在现实世界中很少成立。因此,有必要扩展当前分类,以便它可以处理分布不均,成本不平等的数据。本文提出了一种迭代成本敏感的朴素贝叶斯算法,其目的是减少总的分类错误成本,而与类别分布无关。在每次迭代期间,将识别k个最近的邻居,并形成一个新的训练集,该训练集用于学习未解决的实例。利用最近邻方法的特点,我还开发了一种新的欠采样方法来解决第二项研究中的不平衡问题。在第二项研究中,我设计了一种通用方法来处理不平衡问题,并从数据集中识别出嘈杂的实例,以创建一个平衡的数据集进行学习。这两种方法都使用多个真实世界的数据集进行了验证。实验结果表明,与某些现有和流行方法相比,我的方法具有更好的性能。

著录项

  • 作者

    Yang, Zhuo.;

  • 作者单位

    The University of Utah.;

  • 授予单位 The University of Utah.;
  • 学科 Business Administration Management.;Information Technology.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 107 p.
  • 总页数 107
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号