...
首页> 外文期刊>Knowledge-Based Systems >High-dimensional feature selection for genomic datasets
【24h】

High-dimensional feature selection for genomic datasets

机译:基因组数据集的高维特征选择

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

A central problem in machine learning and pattern recognition is the process of recognizing the most important features. In this paper, we provide a new feature selection method (DRPT) that consists of first removing the irrelevant features and then detecting correlations between the remaining features. Let D = [A vertical bar b] be a dataset, where b is the class label and A is a matrix whose columns are the features. We solve Ax = b using the least squares method and the pseudo-inverse of A. Each component of x can be viewed as an assigned weight to the corresponding column (feature). We define a threshold based on the local maxima of x and remove those features whose weights are smaller than the threshold. To detect the correlations in the reduced matrix, which we still call A, we consider a perturbation (A) over tilde of A. We prove that correlations are encoded in Delta x =vertical bar x - (x) over tilde vertical bar, where (x) over tilde is the least squares solution of (A) over tilde(x) over tilde = b. We cluster features first based on Delta x and then using the entropy of features. Finally, a feature is selected from each sub-cluster based on its weight and entropy. The effectiveness of DRPT has been verified by performing a series of comparisons with seven state-of-the-art feature selection methods over ten genetic datasets ranging up from 9,117 to 267,604 features. The results show that, over all, the performance of DRPT is favorable in several aspects compared to each feature selection algorithm. (C) 2020 Elsevier B.V. All rights reserved.
机译:机器学习和模式识别中的核心问题是识别最重要的功能的过程。在本文中,我们提供了一种新的特征选择方法(DRPT),其包括首先去除无关的特征,然后检测其余特征之间的相关性。设d = [垂直条b]是数据集,其中b是类标签,a是一个矩阵,其列是特征。我们使用最小二乘法和A的伪逆求解AX = B.可以将X的每个组件视为指定的权重(特征)。我们根据X的局部最大值定义阈值,并删除权重小于阈值的那些特征。为了检测到换算中的相关性,我们仍然呼叫A,我们考虑在TILDE上的扰动(a)。我们证明了在Delta x =垂直条x - (x)上编码的相关性垂直条,其中(x)在波浪上是(a)over tilde = b上的tilde(x)的最小二乘溶液。我们首先基于Delta X然后使用功能的熵来群集功能。最后,基于其权重和熵从每个子集群中选择特征。通过在十个遗传数据集中执行七种最先进的特征选择方法的一系列比较来验证了DTPT的有效性。超过9,117至267,604个功能。结果表明,与每个特征选择算法相比,在几个方面,驱动器的性能有利。 (c)2020 Elsevier B.v.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号