首页> 外文学位 >Kernel partial least squares (K-PLS) for scientific data mining .
【24h】

Kernel partial least squares (K-PLS) for scientific data mining .

机译:用于科学数据挖掘的内核偏最小二乘(K-PLS)。

获取原文
获取原文并翻译 | 示例

摘要

The aim of this dissertation is the use of kernel partial least squares (K-PLS) for scientific data mining. K-PLS is a machine learning technique that applies the kernel trick to partial least squares, a statistical technique commonly used for collinear data problems in chemometrics and drug design. It can be shown that K-PLS is closely related to modern machine learning techniques such as support vector machines and can also be interpreted as a neural network. Learning is a broad concept and can commonly be divided in 4 complex systems tasks: (1) Problem representation, (2) Data preprocessing, (3) Predictive modeling, and (4) Variable and Feature selection. Each of these components contributes to model transparency and prediction performance.; For the preprocessing part, a basic data transformation technique, Principal Component Analysis (PCA), has been extended to Independent Components Analysis (ICA). The ICA Transform (ICAT) and ICA based data cleansing have been introduced. In addition, a novel kernel centering algorithm has been introduced.; In the machine learning part, SUpport vector Parsimonious ANOVA (SUPANOVA) transparent (reversible) spline kernel has been implemented to improve the causality analysis of the model. The proposed new spline kernel has also been integrated into the K-PLS framework. K-PLS algorithm has also been extended so that it can be implemented with any loss function for multiple responses. Additionally, Renyi's quadratic entropy loss function has been used to deal with unbalanced classification problems.; Two new variable selection algorithms have been introduced in this thesis: (1) Feature selection based on sigma-tuning of the Gaussian kernel, and (2) Random Forests feature selection. These variable selection methods have been demonstrated on benchmark data sets and compared with other feature selection methods based on sensitivity analysis and Z-scores.; Finally, these methodologies have been applied to three different scientific data mining problems: (1) Predicting ischemia from magnetocardiogram data; (2) Quantitative Structure-Activity Relationship (QSAR) drug design for the discovery of novel pharmaceuticals; and (3) Identification of trace materials from terahertz spectra.
机译:本文的目的是将核偏最小二乘(K-PLS)用于科学数据挖掘。 K-PLS是一种将内核技巧应用于部分最小二乘的机器学习技术,这是一种统计技术,通常用于化学计量学和药物设计中的共线数据问题。可以证明,K-PLS与现代机器学习技术(例如支持向量机)密切相关,并且也可以解释为神经网络。学习是一个广泛的概念,通常可以分为4个复杂的系统任务:(1)问题表示,(2)数据预处理,(3)预测建模和(4)变量和特征选择。这些组件中的每一个都有助于模型的透明度和预测性能。对于预处理部分,基本数据转换技术主成分分析(PCA)已扩展到独立成分分析(ICA)。引入了ICA转换(ICAT)和基于ICA的数据清理。此外,还介绍了一种新颖的内核居中算法。在机器学习部分,已实现了支持向量简约ANOVA(SUPANOVA)透明(可逆)样条内核,以改善模型的因果关系分析。拟议的新样条内核也已集成到K-PLS框架中。 K-PLS算法也得到了扩展,因此可以用任何损失函数实现多个响应。此外,人一的二次熵损失函数已用于处理不平衡的分类问题。本文引入了两种新的变量选择算法:(1)基于高斯核的sigma-tuning的特征选择;(2)Random Forests特征选择。这些变量选择方法已在基准数据集上得到证明,并与其他基于敏感性分析和Z评分的特征选择方法进行了比较。最后,这些方法已应用于三个不同的科学数据挖掘问题:(1)从心电图数据预测缺血; (2)用于发现新药物的定量构效关系(QSAR)药物设计; (3)从太赫兹光谱中鉴定痕量物质。

著录项

  • 作者

    Han, Long.;

  • 作者单位

    Rensselaer Polytechnic Institute.;

  • 授予单位 Rensselaer Polytechnic Institute.;
  • 学科 Engineering Industrial.; Operations Research.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 151 p.
  • 总页数 151
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 一般工业技术;运筹学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号