首页> 外文期刊>Genes and genomics >Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data
【24h】

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data

机译:使用平行杂交特征选择在微阵列基因表达数据上提高癌症类型的分类准确性

获取原文
获取原文并翻译 | 示例
           

摘要

Background Data mining techniques are used to mine unknown knowledge from huge data. Microarray gene expression (MGE) data plays a major role in predicting type of cancer. But as MGE data is huge in volume, applying traditional data mining approaches is time consuming. Hence parallel programming frameworks like Hadoop, Spark and Mahout are necessary to ease the task of computation. Objective Not all the gene expressions are necessary in prediction, it is very essential to select important genes for improving classification accuracy. So feature selection algorithms are parallelized and executed on Spark framework to eliminate unnecessary genes and identify only predictive genes in very less time without affecting prediction accuracy. Methods Parallelized hybrid feature selection (HFS) method is proposed to serve the purpose. This method includes parallelized correlation feature subset selection followed by rank-based feature selection methods. The selected subset of genes is evaluated using parallel classification algorithms. The accuracy values obtained are compared with existing rank-weight feature selection, parallelized recursive feature selection methods and also with the values obtained by executing parallelized HFS on DistributedWekaSpark. Results The classification accuracy obtained with the proposed parallelized HFS method is 97% and 79% for gastric cancer and childhood leukemia respectively. The proposed parallelized HFS method produced similar to 4% to similar to 15% improvement in classification accuracy when compared with previous methods. Conclusion The results reveal the fact that the proposed parallelized feature selection algorithm is scalable to growing medical data and predicts cancer sub-types in lesser time with higher accuracy.
机译:背景技术数据挖掘技术用于从大数据中挖掘未知知识。微阵列基因表达(MGE)数据在预测癌症类型中起主要作用。但随着MGE数据的巨大巨大,应用传统的数据挖掘方法是耗时的。因此,需要平行的编程框架,如Hadoop,Spark和Mahout是为了缓解计算任务。目的不是在预测中需要所有基因表达,因此选择重要基因是提高分类准确性的重要基因。所以特征选择算法并行化并在火花框架上执行,以消除不必要的基因并在极少较少的时间内仅识别预测基因而不会影响预测精度。方法提出了并行化混合特征选择(HFS)方法以满足目的。该方法包括并行化相关特征子集选择,然后是基于秩的特征选择方法。使用并行分类算法评估所选择的基因子集。将获得的精度值与现有的秩重特征选择,并行化递归特征选择方法进行比较,并且还具有通过在分布式Wekaspark上执行并行HFS而获得的值。结果分别采用拟议的并行HFS方法获得的分类精度分别为胃癌和儿童白血病的97%和79%。与以前的方法相比,所提出的并行化HFS方法类似于4%,类似于分类精度的提高15%。结论结果表明,所提出的并行化特征选择算法可扩展到生长医学数据,并以更高的准确度在较小的时间内预测癌症子类型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号