【24h】

Active Mining Discriminative Gene Sets (Invited)

机译:主动采矿区分基因集(已邀请)

获取原文
获取原文并翻译 | 示例

摘要

Searching for good discriminative gene sets (DGSs) in mi-croarray data is important for many problems, such as precise cancer diagnosis, correct treatment selection, and drug discovery. Small and good DGSs can help researchers eliminate "irrelavent" genes and focus on "critical" genes that may be used as biomarkers or that are related to the development of cancers. In addition, small DGSs will not impose demanding requirements to classifiers, e.g., high-speed CPUs, large mem-orys,-etc. Furthermore, if the DGSs are used as diagnostic measures in the future, small DGSs will simplify the test and therefore reduce the cost. Here, we propose an algorithm of searching for DGSs, which we call active mining discriminative gene sets (AM-DGS). The searching scheme of the AM-DGS is as follows: the gene with a large t-statistic is assigned as a seed, i.e., the first feature of the DGS. We classify the samples in a data set using a support vector machine (SVM). Next, we add the gene with the greatest power to correct the misclassified samples into the DGS, that is the gene with the largest t-statistic evaluated with only the mis-classified samples is added. We keep on adding genes into the DGS according to the SVM's mis-classified data until no error appears or overfitting occurs. We tested the proposed method with the well-known leukemia data set. In this data set, our method obtained two 2-gene DGSs that achieved 94.1% testing accuracy and a 4-gene DGS that achieved 97.1% testing accuracy. This result showed that our method obtained better accuracy with much smaller DGSs compared to 3 widely used methods, i.e., T-statistics, F-statistics, and SVM-based recursive feature elimination (SVM-RFE).
机译:在微型阵列数据中搜索良好的区分基因集(DGS)对于许多问题很重要,例如精确的癌症诊断,正确的治疗选择和药物发现。小型而优质的DGS可以帮助研究人员消除“无反应的”基因,并专注于可用作生物标志物或与癌症发展有关的“关键”基因。此外,小型DGS不会对分类器(例如,高速CPU,大型内存等)提出苛刻的要求。此外,如果将来将DGS用作诊断措施,则小型DGS将简化测试并因此降低成本。在这里,我们提出了一种搜索DGS的算法,我们将其称为主动挖掘判别基因集(AM-DGS)。 AM-DGS的搜索方案如下:将t统计量大的基因分配为种子,即DGS的第一个特征。我们使用支持向量机(SVM)将样本分类为数据集。接下来,我们将具有最大能力的基因添加到DGS中,以将错误分类的样本进行校正,即添加了仅使用错误分类的样本评估的t统计量最大的基因。我们继续根据SVM错误分类的数据将基因添加到DGS中,直到没有错误出现或出现过度拟合的情况为止。我们用著名的白血病数据集测试了该方法。在此数据集中,我们的方法获得了两个达到94.1%的测试精度的2基因DGS和一个达到97.1%的测试精度的4基因DGS。结果表明,与3种广泛使用的方法(即T统计量,F统计量和基于SVM的递归特征消除(SVM-RFE))相比,我们的方法在DGS较小的情况下获得了更高的精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号