首页> 外文会议>International meeting on computational intelligence methods for bioinformatics and biostatistics >A Unified Regularized Group PLS Algorithm Scalable to Big Data. Application on Genomics Data
【24h】

A Unified Regularized Group PLS Algorithm Scalable to Big Data. Application on Genomics Data

机译:统一的正则化组PLS算法可扩展到大数据。基因组学数据的应用

获取原文

摘要

Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocs of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modelling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparsity version of PLS methods is the link between the SVD of a matrix (constructed from deflated versions of the original matrices of data) and least squares minimisation in linear regression. We present here an accurate description of the most popular PLS methods, alongside their mathematical proofs. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. Our methods enable us to identify important relationships between genomic expression and cytokine data from an HIV vaccination trial. We also proposed a new methodology by accounting for both grouping of genetic markers (e.g. genesets) and temporal effects. Finally, various approaches to decrease the computation time are offered, and we show how the whole procedure can be scalable to big data sets.
机译:部分最小二乘(PLS)方法已经大量剥削以分析两个数据之间的关联。这些强大的方法可以应用于数据集的数据集,其中变量的数量大于观察的数量,并且在变量之间存在高的共线性。已经开发了不同稀疏版本的PLS来集成多个数据集,同时选择贡献变量。稀疏建模是获得更好的估计器和识别多个数据集之间的关联的关键因素。 PLS方法的稀疏版本的基石是矩阵的SVD之间的链接(由数据的原始矩阵的上矩形的版本构造)和线性回归中最小的正方形最小化。我们在此提供最受欢迎的PLS方法的准确描述,以及他们的数学证据。提出了一个统一的算法,以执行所有四种类型的PLS,包括其正则化版本。我们的方法使我们能够从HIV疫苗接种试验中识别基因组表达和细胞因子数据之间的重要关系。我们还通过核算遗传标记分组(例如基因)和时间效应来提出新方法。最后,提供了减少计算时间的各种方法,并且我们展示了整个过程如何可扩展到大数据集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号