首页> 外文学位 >Variable and feature selection in large datasets.
【24h】

Variable and feature selection in large datasets.

机译:大型数据集中的变量和特征选择。

获取原文
获取原文并翻译 | 示例

摘要

Variable and feature selection are an important component in the manipulation and the analysis of massive data sets. The idea is to preprocess the data, which may contain a large number of features, and filter out irrelevant or redundant features. The reduced data can then be further analyzed with standard data mining or machine learning techniques. The work presented here is motivated by machine learning applications, where feature selection can be applied to unlabeled data, or alternatively to the labeled data that is available for training. In the machine learning literature the selection of features from the unlabeled data is called unsupervised feature selection, while the selection of features from the labeled data is called supervised feature selection. We describe new algorithms for both the supervised and the unsupervised case that can efficiently perform feature selection from large amounts of data. The first algorithm addresses the unsupervised case. It improves on the current state-of-the-art unsupervised feature selection algorithms in terms of run time and the number of passes over the data. The algorithm is a modification of the classical pivoted QR algorithm of Businger and Golub. It selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. The classical algorithm stood unchallenged (in terms of run time and the number of passes) for almost 40 years. We describe experiments on real-world datasets which sometimes show an improvement by several orders of magnitude over the commonly used classical algorithm.;Algorithms for unsupervised feature selection cannot be used for supervised feature selection. Our contribution to supervised feature selection is for a generalization of the standard supervised feature selection case, where one attempts to approximate an entire matrix in terms of another matrix. Specifically, we consider simultaneously approximating all the columns of a data matrix in terms of few selected columns of another matrix that is sometimes called "the dictionary''. We describe fast algorithms for this task. Our algorithms improve on the speed and the memory requirements of the current state-of-the-art, while producing the exact same output. It enables applying feature selection on large and sparse datasets that could not be handled by previously known techniques. For example, we describe results on a very large and sparse commonly available dataset, which takes our algorithm less than 4 minutes and 150 megabytes of memory. Using a naive approach for the same problem may take hundreds of thousands of years. Using the current state-of-the-art would take about 7 hours, and would require 240 gigabytes of memory.
机译:变量和特征选择是操纵和分析海量数据集的重要组成部分。想法是预处理可能包含大量特征的数据,并过滤掉不相关或多余的特征。然后可以使用标准数据挖掘或机器学习技术进一步分析减少的数据。此处介绍的工作是由机器学习应用程序驱动的,在该应用程序中,特征选择可以应用于未标记的数据,也可以应用于可用于训练的标记数据。在机器学习文献中,从未标记数据中选择特征称为无监督特征选择,而从标记数据中选择特征称为监督特征选择。我们描述了针对有监督和无监督情况的新算法,它们可以有效地从大量数据中执行特征选择。第一种算法解决了无监督的情况。在运行时间和通过数据的次数方面,它改进了当前最新的无监督特征选择算法。该算法是Businger和Golub的经典透视QR算法的改进。它选择的功能与经典的枢轴QR算法完全相同,并且具有相同的良好数值稳定性。经典算法在运行时间和通过次数方面一直保持了将近40年的历史。我们描述了在现实世界数据集上进行的实验,这些实验有时显示出比常用的经典算法提高了几个数量级。无监督特征选择算法不能用于监督特征选择。我们对有监督的特征选择的贡献是对标准有监督的特征选择情况的推广,在这种情况下,人们试图用另一个矩阵来近似整个矩阵。具体来说,我们考虑同时根据另一个矩阵(有时称为“字典”)的少数选定列来近似逼近数据矩阵的所有列,为此我们描述了快速算法,我们的算法提高了速度和内存要求产生完全相同的输出,从而能够将特征选择应用于大型稀疏数据集,而以前的已知技术无法处理这些特征,例如,我们在非常稀疏的数据集上描述结果通用数据集,它占用了我们的算法不到4分钟的时间,占用的内存为150兆字节。针对同一问题使用幼稚的方法可能要花费数十万年的时间。使用当前的最新技术大约需要7个小时,并需要240 GB的内存。

著录项

  • 作者

    Maung, Crystal.;

  • 作者单位

    The University of Texas at Dallas.;

  • 授予单位 The University of Texas at Dallas.;
  • 学科 Computer Science.;Information Science.;Artificial Intelligence.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 98 p.
  • 总页数 98
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 康复医学;
  • 关键词

  • 入库时间 2022-08-17 11:53:49

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号