首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Extreme Phenotype Sampling Improves LASSO and Random Forest Marker Selection for Complex Traits
【24h】

Extreme Phenotype Sampling Improves LASSO and Random Forest Marker Selection for Complex Traits

机译:极端表型采样改善了复杂性状的套索和随机森林标记选择

获取原文

摘要

Most attempts to fit a supervised machine learning (ML) model in bioinformatics try to predict the full range of trait or response values. While such prediction tasks effectively capture the entire phenotypic range of the samples, they are cost prohibitive and can be statistically underpowered for detection of rare variants. In a study design known as extreme phenotype sampling (EPS), samples are selected from the two extremes of the phenotypic distribution. This approach is costcutting, by reducing genotyping/sequencing costs, as well as capable of increasing statistical power. Although combining EPS with ML algorithms has the potential to enhance association studies by improving their computational efficiency, EPS-ML approaches have seen limited use. In this paper we demonstrate an efficient and effective approach to leverage the EPS study design using LASSO regression and random forests, two commonly used ML algorithms within the broader bioinformatics community. We analyze two distinct data sets: leaf expression values generated from black cottonwood and malaria parasite transcriptome data collected from patients. We demonstrate that focusing only on the phenotypic extremes of these sample sets (by forming binary classes) can select more biologically meaningful features than using the full range. This approach will be useful to investigators when examining complex or novel traits. It is particularly well-suited to RNA-seq data where investigators often want to narrow attention to a small number of candidate transcripts out of a large initial pool. Our approach intentionally leverages existing software with efficient implementations to enable future applications of EPS-ML.
机译:大多数尝试在生物信息学中拟合监督机器学习(ML)模型尝试预测全部特征或响应值。虽然这种预测任务有效地捕获样品的整个表型范围,但它们是成本越平的并且可以在统计上用于检测罕见变体。在称为极端表型采样(EPS)的研究设计中,样品选自表型分布的两个极端。通过降低基因分型/测序成本以及能够增加统计功率,这种方法是代价的。虽然将EPS与ML算法组合有可能通过提高其计算效率来增强关联研究,但EPS-ML方法已经看到有限的使用。在本文中,我们展示了利用卢斯回归和随机森林利用EPS研究设计的有效和有效的方法,在更广泛的生物信息学社区中的两个常用的ML算法。我们分析了两个不同的数据集:从患者收集的黑色杨纸和疟疾寄生虫转录组数据产生的叶片表达值。我们证明仅关注这些样本集的表型极端(通过形成二进制类)可以选择比使用全方位更具生物学有意义的特征。在检查复杂或新特征时,这种方法对调查人员有用。它特别适合于RNA-SEQ数据,其中调查人员通常希望将注意力缩小到大型初始池中的少量候选成绩单。我们的方法有意利用现有软件,以有效的实现,以便能够实现EPS-ML的未来应用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号