首页> 外文会议>International Conference on IT Convergence and Security >An Improved SVM-T-RFE Based on Intensity-Dependent Normalization for Feature Selection in Gene Expression of Big-Data
【24h】

An Improved SVM-T-RFE Based on Intensity-Dependent Normalization for Feature Selection in Gene Expression of Big-Data

机译:基于强度依赖归一化的改进SVM-T-RFE用于大数据基因表达的特征选择

获取原文

摘要

Thanks to Next-Generation-Sequencing (NGS) revolutionary, high-throughput RNA sequencing data (RNA-seq) has become a highly sensitive and accurate method of measuring gene expression. Since RNA-seq generate a huge amount of data they have been struggling to overcome the lack of computational methods to exploit the enormous RNA-seq Big-Data. In most of cases, those methods have not been adequate for feature scaling scheme on RNA-seq Big-Data. So, RNA-seq encourages computational biologist to identify both novel and well-known features, although it have led to an increase in an adoption of previous methods and development of newly scalable data analysis ones. And it provides recognition of some deep learning methods which are scalable and adaptable for assuming and selecting the highly correlated genes for classification and prediction. However, some assumption of those methods have not been always correct and they have been considered unstable in terms of large-scale gene expression profiling. Therefore we propose improved feature selection technique of well-known support vector machine recursive feature elimination (SVM-RFE) with T-Statistics based on Intensity-dependent normalization, which uses log differential expression ratio (M vs A plot) for improving scalability. In each iteration of SVM-RFE, less dominated feature set with respect to relevance and redundancy is excluded from this set of features. In the proposed algorithm, the most relevant and less redundant feature is included in the final feature set, accomplishing comparable accuracy with a small subsets of Big-Data, such as NCBI-GEO. The proposed algorithm is compared with the existing one on several known data. It finds that the proposed algorithm have become convenient and quick than previous because it uses all functions in R package and have more improvement with regard to the time consuming in terms of Big-Data.
机译:得益于下一代测序(NGS)的革命性成果,高通量RNA测序数据(RNA-seq)已成为测量基因表达的高度灵敏且准确的方法。由于RNA序列产生大量数据,因此他们一直在努力克服缺乏利用巨大RNA序列大数据的计算方法的问题。在大多数情况下,这些方法不足以用于RNA-seq大数据上的特征缩放方案。因此,RNA-seq鼓励计算生物学家识别新颖的和众所周知的功能,尽管它导致了对以前方法的采用和新可扩展数据分析方法的发展。并且它提供了对某些深度学习方法的认可,这些方法可扩展且适用于假设和选择高度相关的基因以进行分类和预测。但是,这些方法的某些假设并不总是正确的,就大规模基因表达谱而言,它们被认为是不稳定的。因此,我们提出了一种基于强度依赖归一化的,具有T统计量的支持向量机递归特征消除(SVM-RFE)的改进特征选择技术,该技术使用对数差异表达比(M vs A图)来提高可伸缩性。在SVM-RFE的每次迭代中,就相关性和冗余而言,较少占主导地位的特征集将从这组特征中排除。在提出的算法中,最相关且冗余度较低的功能包含在最终功能集中,从而可以使用较小的大数据子集(例如NCBI-GEO)实现可比的准确性。在几种已知数据上,将所提出的算法与现有算法进行比较。发现所提出的算法比以前的算法更加方便快捷,因为它使用了R包中的所有功能,并且在大数据方面的耗时方面有了更多的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号