High-dimensional data classification is a challenging problem. A standardapproach to tackle this problem is to perform variables selection, e.g. usingstep-wise or LASSO procedures. Another standard way is to perform dimensionreduction, e.g. by Principal Component Analysis or Partial Least Squareprocedures. The approach proposed in this paper combines both dimensionreduction and variables selection. First, a procedure of clustering ofvariables is used to built groups of correlated variables in order to reducethe redundancy of information. This dimension reduction step relies on the Rpackage ClustOfVar which can deal with both numerical and categoricalvariables. Secondly, the most relevant synthetic variables (which are numericalvariables summarizing the groups obtained in the first step) are selected witha procedure of variable selection using random forests, implemented in the Rpackage VSURF. Numerical performances of the proposed methodology calledCoV/VSURF are compared with direct applications of VSURF or random forests onthe original $p$ variables. Improvements obtained with the CoV/VSURF procedureare illustrated on two simulated mixed datasets (cases $nextgreater{}p$ and$nextless{}extless{}p$) and on a real proteomic dataset.
展开▼
机译:高维数据分类是一个具有挑战性的问题。解决这个问题的标准图案是执行变量选择,例如,使用step-wise或lasso程序。另一种标准方法是执行二维测量,例如,通过主成分分析或部分最小二乘性。本文提出的方法结合了维度和变量选择。首先,使用variables群集的过程用于构建相关变量组,以便将信息的冗余冗余。该尺寸还原步骤依赖于可以处理数字和分类的RPackage Clustofvar。其次,使用随机林的可变选择过程选择最相关的合成变量(这是概述第一步中获得的组的数值偏离的数值偏离,在RPackage VSurf中实现。将拟议方法的数值表演称为COV / VSURF的直接应用于原始$ P $变量的VSURF或随机林。使用两个模拟混合数据集中所示的COV / VSURF程序(案例$ N TextGreater {} P $和$ N Textless {} TextLess {})以及真正的蛋白质组学数据集中而获得的改进。
展开▼