首页> 外文学位 >Metodos para mejorar la calidad de un conjunto de datos para descubrir conocimiento.
【24h】

Metodos para mejorar la calidad de un conjunto de datos para descubrir conocimiento.

机译:改善发现知识的数据集质量的方法。

获取原文
获取原文并翻译 | 示例

摘要

Today, data generation is growing exponentially in both directions; instances (rows) and features (columns). This causes that many datasets can not be analyzed without preprocessing. The large size of the dataset to be analyzed may produce serious problems to some data mining algorithms in scalability as well in performance. On the other hand the quality of the data could be inadequate for the knowledge discovery process. For this reason, it is necessary to preprocess the dataset to make it suitable for an efficient performance of the data mining algorithm, and in order to obtain accurate results from it. In this thesis, we introduced new measures to evaluate the quality of a dataset in the context of supervised classification. From these quality measures, we obtain two ways of quantifying the data complexity for a classification problem, specifically, we try to anticipate the behavior of a classification algorithm given a dataset. Our data complexity measures are compared with others already available in the literature, and they give similar performance, but with a lower computational cost. For data cleaning, we propose a new method, which is independent of the classification algorithm. The proposed method detects and eliminates the noise in each class. Our method performs with more efficiency and accuracy than other methods already available in the literature. In the context of dimensionality reduction, we propose two new methods for feature selection. These methods are compared with two well known feature selection methods, the RELIEF and the Sequential Forward Selection (SFS), and similar results are obtained but with a much lower computational costs. Furthermore, we propose a new algorithm, which improves the scalability of the algorithms for instance selection currently in use. Finally, we integrate the three processes: data cleaning, reduction of dimensionality, and instance selection, in order to generate a training set, which it will permit an efficient performance of the data mining algorithms, yielding accurate results.
机译:如今,数据生成在两个方向都呈指数增长。实例(行)和要素(列)。这导致许多数据集如果没有预处理就无法分析。要分析的数据集的大小可能会给某些数据挖掘算法在可伸缩性和性能方面带来严重问题。另一方面,数据的质量可能不足以进行知识发现过程。因此,有必要对数据集进行预处理以使其适合于数据挖掘算法的有效执行,并从中获取准确的结果。在本文中,我们介绍了在监督分类的背景下评估数据集质量的新方法。从这些质量度量中,我们获得了两种量化分类问题数据复杂度的方法,具体地说,我们尝试在给定数据集的情况下预期分类算法的行为。我们将数据复杂性度量与文献中已有的其他度量进行了比较,它们具有相似的性能,但计算成本较低。对于数据清理,我们提出了一种新的方法,该方法与分类算法无关。所提出的方法检测并消除每个类别中的噪声。我们的方法比文献中已有的其他方法具有更高的效率和准确性。在降维的背景下,我们提出了两种新的特征选择方法。将这些方法与两种众所周知的特征选择方法RELIEF和顺序前向选择(SFS)进行了比较,虽然获得了相似的结果,但计算成本却低得多。此外,我们提出了一种新算法,该算法提高了当前使用的实例选择算法的可扩展性。最后,我们集成了三个过程:数据清理,降维和实例选择,以生成训练集,这将使数据挖掘算法高效执行,并产生准确的结果。

著录项

  • 作者单位

    University of Puerto Rico, Mayaguez (Puerto Rico).;

  • 授予单位 University of Puerto Rico, Mayaguez (Puerto Rico).;
  • 学科 Statistics.; Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 172 p.
  • 总页数 172
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 统计学;自动化技术、计算机技术;
  • 关键词

  • 入库时间 2022-08-17 11:39:12

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号