首页> 美国卫生研究院文献>PLoS Clinical Trials >Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data
【2h】

Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data

机译:使用基于规则的机器学习进行候选疾病基因优先级排序和癌症基因表达数据的样本分类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL’s classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.
机译:芯片数据分析已显示为研究癌症和遗传疾病提供了有效的工具。尽管经典的机器学习技术已成功应用于发现信息基因并预测新样品的类别标签,但微阵列分析的常见限制(例如样品量小,属性空间大和噪声水平高)仍然限制了其科学和临床应用。在保持高精度的同时提高预测模型的可解释性将有助于更有效地利用微阵列数据中的信息内容。为此,我们在三个公共微阵列癌症数据集上评估了基于规则的进化机器学习系统BioHEL和GAssist,以获得用于样本分类的基于规则的简单模型。与基于三种不同特征选择算法的其他基准微阵列样本分类器的比较表明,这些进化学习技术可以与最先进的方法(如支持向量机)竞争。所获得的模型在两级外部交叉验证中达到了90%以上的准确度,并且通过仅使用简单的if-then-else规则的组合来促进解释的附加值。作为进一步的好处,文献挖掘分析表明,从BioHEL的分类规则集中提取的信息基因的优先顺序,在相关疾病术语和top-standards标准化名称之间的逐点互信息方面,可以胜过从常规整体特征选择中获得的基因排名。排名基因。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号