首页> 外文学位 >Pre-processing methods and stepwise variable selection for binary classification of high-dimensional data.
【24h】

Pre-processing methods and stepwise variable selection for binary classification of high-dimensional data.

机译:高维数据二进制分类的预处理方法和逐步变量选择。

获取原文
获取原文并翻译 | 示例

摘要

Classification of biological data has gained a lot of attention in recent years. It is of particular importance in cancer-related data, where an accurate classification could make a huge difference. This study is restricted to binary classification of genomic and metabolomic data. We are interested in modeling the response variable which could be an indicator of a tumor or a type of tumor, as a function of the genes or frequencies represented in the biological sample units. Apart from classification, identifying the underlying variables which drive the classification is a critical issue. Unlike many other studies which focus on dimension reduction, we emphasize on variable selection. The high-dimensionality of these data coupled with extremely small sample sizes renders conventional modeling methods inefficient.;First, some pre-processing methods are studied and applied to eliminate noise variables. Depending on the nature of the data, a single method may not work in all cases. The dataset is split into a training set and a cross-validation set. The core algorithm is based on the technique of partial least squares regression. We have developed a new stepwise method wherein the sum of squares of errors plays a major role in efficiently identifying explanatory variables significant to the classification. It is to be pointed out here that we are not after finding the best model as such a model may not exist. Trade-offs between accuracy and efficiency are cautiously regarded. Performance is assessed by analyzing the leave-one-sample-out misclassification error rates for both the training data and the cross-validation data. Developed methodologies are then applied to another dataset without a split. Biological data exhibit high correlation among predictor variables. Due to this reason, correlation analyses of finally selected variables with all the other variables is required. This gives us an idea of possible variables that may have similar impact. This may be of more importance to biologists who can utilize this information for appropriate interpretation.
机译:近年来,生物学数据的分类引起了很多关注。这在与癌症相关的数据中尤为重要,因为准确的分类可能会带来巨大的不同。这项研究仅限于基因组和代谢组学数据的二进制分类。我们感兴趣的是根据生物样品单位中表示的基因或频率对响应变量建模,该变量可能是肿瘤或肿瘤类型的指标。除了分类之外,识别驱动分类的基本变量也是一个关键问题。与许多其他关注降维的研究不同,我们强调变量选择。这些数据的高维数加上极小的样本量使传统的建模方法效率低下。首先,研究了一些预处理方法并将其用于消除噪声变量。根据数据的性质,可能无法在所有情况下都使用一种方法。数据集分为训练集和交叉验证集。核心算法基于偏最小二乘回归技术。我们开发了一种新的逐步方法,其中误差平方和在有效识别对分类重要的解释变量方面起着主要作用。这里要指出的是,我们并不是在找到最佳模型后才开始,因为这样的模型可能不存在。在精度和效率之间要谨慎权衡。通过分析训练数据和交叉验证数据的遗忘一样本错误分类错误率来评估性能。然后将已开发的方法应用于其他数据集而无需拆分。生物数据在预测变量之间显示出高度相关性。由于这个原因,需要对最终选择的变量与所有其他变量进行相关分析。这使我们对可能具有类似影响的变量有了一个想法。这对于可以利用此信息进行适当解释的生物学家而言可能更为重要。

著录项

  • 作者

    Ramachandar, Shahla.;

  • 作者单位

    The University of Texas at Dallas.;

  • 授予单位 The University of Texas at Dallas.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2010
  • 页码 88 p.
  • 总页数 88
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 康复医学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号