Variable and feature selection in large datasets.

机译：大型数据集中的变量和特征选择。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Variable and feature selection are an important component in the manipulation and the analysis of massive data sets. The idea is to preprocess the data, which may contain a large number of features, and filter out irrelevant or redundant features. The reduced data can then be further analyzed with standard data mining or machine learning techniques. The work presented here is motivated by machine learning applications, where feature selection can be applied to unlabeled data, or alternatively to the labeled data that is available for training. In the machine learning literature the selection of features from the unlabeled data is called unsupervised feature selection, while the selection of features from the labeled data is called supervised feature selection. We describe new algorithms for both the supervised and the unsupervised case that can efficiently perform feature selection from large amounts of data. The first algorithm addresses the unsupervised case. It improves on the current state-of-the-art unsupervised feature selection algorithms in terms of run time and the number of passes over the data. The algorithm is a modification of the classical pivoted QR algorithm of Businger and Golub. It selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. The classical algorithm stood unchallenged (in terms of run time and the number of passes) for almost 40 years. We describe experiments on real-world datasets which sometimes show an improvement by several orders of magnitude over the commonly used classical algorithm.;Algorithms for unsupervised feature selection cannot be used for supervised feature selection. Our contribution to supervised feature selection is for a generalization of the standard supervised feature selection case, where one attempts to approximate an entire matrix in terms of another matrix. Specifically, we consider simultaneously approximating all the columns of a data matrix in terms of few selected columns of another matrix that is sometimes called "the dictionary''. We describe fast algorithms for this task. Our algorithms improve on the speed and the memory requirements of the current state-of-the-art, while producing the exact same output. It enables applying feature selection on large and sparse datasets that could not be handled by previously known techniques. For example, we describe results on a very large and sparse commonly available dataset, which takes our algorithm less than 4 minutes and 150 megabytes of memory. Using a naive approach for the same problem may take hundreds of thousands of years. Using the current state-of-the-art would take about 7 hours, and would require 240 gigabytes of memory.

机译：变量和特征选择是操纵和分析海量数据集的重要组成部分。想法是预处理可能包含大量特征的数据，并过滤掉不相关或多余的特征。然后可以使用标准数据挖掘或机器学习技术进一步分析减少的数据。此处介绍的工作是由机器学习应用程序驱动的，在该应用程序中，特征选择可以应用于未标记的数据，也可以应用于可用于训练的标记数据。在机器学习文献中，从未标记数据中选择特征称为无监督特征选择，而从标记数据中选择特征称为监督特征选择。我们描述了针对有监督和无监督情况的新算法，它们可以有效地从大量数据中执行特征选择。第一种算法解决了无监督的情况。在运行时间和通过数据的次数方面，它改进了当前最新的无监督特征选择算法。该算法是Businger和Golub的经典透视QR算法的改进。它选择的功能与经典的枢轴QR算法完全相同，并且具有相同的良好数值稳定性。经典算法在运行时间和通过次数方面一直保持了将近40年的历史。我们描述了在现实世界数据集上进行的实验，这些实验有时显示出比常用的经典算法提高了几个数量级。无监督特征选择算法不能用于监督特征选择。我们对有监督的特征选择的贡献是对标准有监督的特征选择情况的推广，在这种情况下，人们试图用另一个矩阵来近似整个矩阵。具体来说，我们考虑同时根据另一个矩阵（有时称为“字典”）的少数选定列来近似逼近数据矩阵的所有列，为此我们描述了快速算法，我们的算法提高了速度和内存要求产生完全相同的输出，从而能够将特征选择应用于大型稀疏数据集，而以前的已知技术无法处理这些特征，例如，我们在非常稀疏的数据集上描述结果通用数据集，它占用了我们的算法不到4分钟的时间，占用的内存为150兆字节。针对同一问题使用幼稚的方法可能要花费数十万年的时间。使用当前的最新技术大约需要7个小时，并需要240 GB的内存。

著录项

作者
Maung, Crystal.;
展开▼
作者单位

The University of Texas at Dallas.;

展开▼
授予单位 The University of Texas at Dallas.;
学科 Computer Science.;Information Science.;Artificial Intelligence.
学位 Ph.D.
年度 2014
页码 98 p.
总页数 98
原文格式 PDF
正文语种 eng
中图分类康复医学;
关键词
入库时间 2022-08-17 11:53:49

相似文献

外文文献
中文文献
专利

1. Feature selection with limited datasets. [J] . Kupinski MA, Giger ML Medical Physics . 1999,第10期

机译：具有有限数据集的特征选择。
2. A Feature Selection Model to Filter Periodic Variable Stars with Data-sensitive Light-variable Characteristics [J] . Chen Jiwei, Tang Guojian Journal of signal processing systems for signal, image, and video technology . 2021,第7期

机译：具有数据敏感光变量特性的过滤周期变量星的功能选择模型
3. Normalized neighborhood component feature selection and feasible-improved weight allocation for input variable selection [J] . Kim Hansu, Lee Tae Hee, Kwon Taejoon Knowledge-Based Systems . 2021,第Apra22期

机译：归一化邻域组件特征选择和可行改进的输入变量选择权重分配
4. Measurement for Methane Concentration Based on Feature Variable Extraction and Feature Variable Selection [C] . Tang Xiaojun, Zhang Jinyong, Liu Junhua International Symposium on Test and Measurement;ISTM/2005 . 2005

机译：基于特征变量提取和特征变量选择的甲烷浓度测量
5. Robust and efficient feature selection for high-dimensional datasets. [D] . Mo, Dengyao. 2011

机译：高维数据集的稳健而高效的特征选择。
6. Feature expressions: creating and manipulating sequence datasets. [O] . B Fristensky 1993

机译：特征表达式：创建和处理序列数据集。
7. Regression with empirical variable selection: description of a new method and application to ecological datasets. [O] . Goodenough A.E., Hart A.G, Stafford Rick 2012

机译：选择经验变量进行回归：描述一种新方法并将其应用于生态数据集。
8. Using Visualization, Variable Selection and Feature Extraction to Learn from Industrial Data;Doctoral thesis [R] . Laine, S. 2003

机译：利用可视化，变量选择和特征提取来学习工业数据;博士论文

Variable and feature selection in large datasets.

摘要

著录项

相似文献

相关主题

期刊订阅