首页> 外文期刊>Military operations research >Shallow learning model for diagnosing neuro muscular disorder from splicing variants
【24h】

Shallow learning model for diagnosing neuro muscular disorder from splicing variants

机译:从剪接变体诊断神经肌肉疾病的浅层学习模型

获取原文
获取原文并翻译 | 示例
           

摘要

Purpose - Diagnosing genetic neuromuscular disorder such as muscular dystrophy is complicated when the imperfection occurs while splicing. This paper aims in predicting the type of muscular dystrophy from the gene sequences by extracting the well-defined descriptors related to splicing mutations. An automatic model is built to classify the disease through pattern recognition techniques coded in python using scikit-learn framework. Design/methodology/approach - In this paper, the cloned gene sequences are synthesized based on the mutation position and its location on the chromosome by using the positional cloning approach. For instance, in the human gene mutational database (HGMD), the mutational information for splicing mutation is specified as IVS1-5 T > G indicates (IVS - intervening sequence or introns), first intron and five nucleotides before the consensus intron site AG, where the variant occurs in nucleotide G altered to T. IVS (+ve) denotes forward strand 3′-positive numbers from G of donor site invariant and IVS (-ve) denotes backward strand 5′-negative numbers starting from G of acceptor site. The key idea in this paper is to spot out discriminative descriptors from diseased gene sequences based on splicing variants and to provide an effective machine learning solution for predicting the type of muscular dystrophy disease with the splicing mutations. Multi-class classification is worked out through data modeling of gene sequences. The synthetic mutational gene sequences are created, as the diseased gene sequences are not readily obtainable for this intricate disease. Positional cloning approach supports in generating disease gene sequences based on mutational information acquired from HGMD. SNP-, gene- and exon-based discriminative features are identified and used to train the model. An eminent muscular dystrophy disease prediction model is built using supervised learning techniques in scikit-learn environment. The data frame is built with the extracted features as numpy array. The data are normalized by transforming the feature values into the range between 0 and 1 aid in scaling the input attributes for a model. Naïve Bayes, decision tree, K-nearest neighbor and SVM learned models are developed using python library framework in scikit-learn. Findings - To the best knowledge of authors, this is the foremost pattern recognition model, to classify muscular dystrophy disease pertaining to splicing mutations. Certain essential SNP-, gene- and exon-based descriptors related to splicing mutations are proposed and extracted from the cloned gene sequences. An eminent model is built using statistical learning technique through scikit-learn in the anaconda framework. This paper also deliberates the results of statistical learning carried out with the same set of gene sequences with synonymous and non-synonymous mutational descriptors. Research limitations/implications - The data frame is built with the Numpy array. Normalizing the data by transforming the feature values into the range between 0 and 1 aid in scaling the input attributes for a model. Naïve Bayes, decision tree, K-nearest neighbor and SVM learned models are developed using python library framework in scikit-learn. While learning the SVM model, the cost, gamma and kernel parameters are tuned to attain good results. Scoring parameters of the classifiers are evaluated using tenfold cross-validation using metric functions of scikit-learn library. Results of the disease identification model based on non-synonymous, synonymous and splicing mutations were analyzed. Practical implications - Certain essential SNP-, gene- and exon-based descriptors related to splicing mutations are proposed and extracted from the cloned gene sequences. An eminent model is built using statistical learning technique through scikit-learn in the anaconda framework. The performance of the classifiers are increased by using different estimators from the scikit-learn library. Several types of mutations such as missense, non-sense and silent mutations are also considered to build models through statistical learning technique and their results are analyzed. Originality/value - To the best knowledge of authors, this is the foremost pattern recognition model, to classify muscular dystrophy disease pertaining to splicing mutations.
机译:目的-当剪接时出现缺陷时,诊断遗传性神经肌肉疾病(如肌肉营养不良)会很复杂。本文旨在通过提取与剪接突变相关的定义明确的描述符,从基因序列中预测肌肉营养不良的类型。通过使用scikit-learn框架以python编码的模式识别技术,构建了自动模型来对疾病进行分类。设计/方法/方法-在本文中,使用位置克隆方法根据突变位置及其在染色体上的位置合成了克隆的基因序列。例如,在人类基因突变数据库(HGMD)中,剪接突变的突变信息指定为IVS1-5 T> G表示(IVS-插入序列或内含子),第一个内含子和共有内含子位点AG前的五个核苷酸,其中变体出现在核苷酸G上改变为T。IVS(+ ve)表示来自供体位点G的前向链3'阳性数,IVS(-ve)表示从受体位点G开始的后向链5'负数。本文的主要思想是从基于剪接变体的患病基因序列中识别出具有区别性的描述子,并为预测具有剪接突变的肌肉营养不良疾病的类型提供有效的机器学习解决方案。通过对基因序列进行数据建模,可以进行多类分类。创建合成的突变基因序列,因为对于这种复杂疾病不容易获得患病的基因序列。位置克隆方法支持基于从HGMD获得的突变信息产生疾病基因序列。识别基于SNP,基因和外显子的判别特征并将其用于训练模型。在scikit学习环境中使用监督学习技术建立了一个突出的肌营养不良症疾病预测模型。数据帧以提取的特征作为numpy数组构建。通过将特征值转换为介于0和1之间的范围来对数据进行归一化,有助于缩放模型的输入属性。使用scikit-learn中的python库框架开发了朴素贝叶斯,决策树,K最近邻和SVM学习模型。发现-据作者所知,这是最主要的模式识别模型,用于对与剪接突变有关的肌营养不良症进行分类。提出了一些与剪接突变有关的必要的基于SNP,基因和外显子的描述符,并从克隆的基因序列中提取了这些描述符。在anaconda框架中通过scikit-learn使用统计学习技术构建了一个杰出的模型。本文还讨论了使用同义和非同义突变描述符的同一组基因序列进行统计学习的结果。研究局限性/含义-数据帧是用Numpy数组构建的。通过将特征值转换为介于0和1之间的范围来规范化数据,有助于缩放模型的输入属性。使用scikit-learn中的python库框架开发了朴素贝叶斯,决策树,K最近邻和SVM学习模型。在学习SVM模型时,调整成本,伽玛和内核参数以获得良好的结果。使用scikit-learn库的度量函数,使用十倍交叉验证对分类器的评分参数进行评估。分析了基于非同义,同义和剪接突变的疾病鉴定模型的结果。实际意义-提出了一些与剪接突变相关的,基于SNP,基因和外显子的基本描述符,并从克隆的基因序列中提取了这些描述符。在anaconda框架中通过scikit-learn使用统计学习技术构建了一个杰出的模型。通过使用scikit-learn库中的不同估算器,可以提高分类器的性能。还考虑了几种类型的突变,例如错义突变,无义突变和沉默突变,通过统计学习技术来建立模型,并分析其结果。原创性/价值-就作者所知,这是最主要的模式识别模型,用于对与剪接突变有关的肌营养不良症进行分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号