...
首页> 外文期刊>Journal of biomedical informatics. >A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification
【24h】

A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification

机译:利用基因本体进行基因表达分类的特征选择方法分析的仿真

获取原文
获取原文并翻译 | 示例

摘要

Gene expression profile classification is a pivotal research domain assisting in the transformation from traditional to personalized medicine. A major challenge associated with gene expression data classification is the small number of samples relative to the large number of genes. To address this problem, researchers have devised various feature selection algorithms to reduce the number of genes. Recent studies have been experimenting with the use of semantic similarity between genes in Gene Ontology (GO) as a method to improve feature selection. While there are few studies that discuss how to use GO for feature selection, there is no simulation study that addresses when to use GO-based feature selection. To investigate this, we developed a novel simulation, which generates binary class datasets, where the differentially expressed genes between two classes have some underlying relationship in GO. This allows us to investigate the effects of various factors such as the relative connectedness of the underlying genes in GO, the mean magnitude of separation between differentially expressed genes denoted by δ, and the number of training samples. Our simulation results suggest that the connectedness in GO of the differentially expressed genes for a biological condition is the primary factor for determining the efficacy of GO-based feature selection. In particular, as the connectedness of differentially expressed genes increases, the classification accuracy improvement increases. To quantify this notion of connectedness, we defined a measure called Biological Condition Annotation Level BCAL( G), where G is a graph of differentially expressed genes. Our main conclusions with respect to GO-based feature selection are the following: (1) it increases classification accuracy when BCAL( G) ≥ 0.696; (2) it decreases classification accuracy when BCAL( G) ≤ 0.389; (3) it provides marginal accuracy improvement when 0.389 < BCAL(G) < 0.696 and δ< 1; (4) as the number of genes in a biological condition increases beyond 50 and δ≥ 0.7, the improvement from GO-based feature selection decreases; and (5) we recommend not using GO-based feature selection when a biological condition has less than ten genes. Our results are derived from datasets preprocessed using RMA (Robust Multi-array Average), cases where δ is between 0.3 and 2.5, and training sample sizes between 20 and 200, therefore our conclusions are limited to these specifications. Overall, this simulation is innovative and addresses the question of when SoFoCles-style feature selection should be used for classification instead of statistical-based ranking measures.
机译:基因表达谱分类是一个重要的研究领域,有助于从传统医学向个性化医学的转化。与基因表达数据分类相关的主要挑战是相对于大量基因而言,样本数量较少。为了解决这个问题,研究人员设计了各种特征选择算法以减少基因数量。最近的研究已经在尝试利用基因本体论(GO)中的基因之间的语义相似性作为一种改进特征选择的方法。虽然很少有讨论如何使用GO进行特征选择的研究,但是没有模拟研究可以解决何时使用基于GO的特征选择的问题。为了对此进行研究,我们开发了一种新颖的模拟程序,该程序生成了二进制类数据集,其中两个类之间的差异表达基因在GO中具有某些潜在关系。这使我们能够研究各种因素的影响,例如GO中基础基因的相对连接性,用δ表示的差异表达基因之间的平均分离幅度以及训练样本的数量。我们的模拟结果表明,针对生物学条件的差异表达基因在GO中的连通性是确定基于GO的特征选择功效的主要因素。特别地,随着差异表达基因的连接性增加,分类准确性的提高也增加。为了量化这种连接性的概念,我们定义了一种称为生物条件注释水平BCAL(G)的量度,其中G是差异表达基因的图。关于基于GO的特征选择,我们的主要结论如下:(1)当BCAL(G)≥0.696时,分类精度提高; (2)当BCAL(G)≤0.389时,分类精度降低; (3)当0.389

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号