...
首页> 外文期刊>Neural computing & applications >Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches
【24h】

Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches

机译:利用物流回归和决策树方法对质量评估核苷酸序列的分类

获取原文
获取原文并翻译 | 示例
           

摘要

Knowledge of DNA sequences is indispensable for basic biological research. Many researchers use DNA sequencing for various purposes including molecular biology research and sequence comparison for individual identification. Automated DNA sequencing devices use four colored chromatograms or base-calling signals to indicate strength of hybridization for each base channel. Typically, relative strengths of peaks at each base location are used to quantify the quality and/or reliability of individual readings. However, assessment of overall quality of whole DNA trace files remains to be an open problem. Therefore, classification of raw DNA trace files as high or low quality is an important issue for efficient utilization of resources. In this study, we have used several supervised machine learning approaches, including logistic regression and ensemble decision trees, to identify high- or acceptable-quality chromatogram files and compared their prediction performances. In order to test and develop our ideas, we have used a public DNA trace repository consisting of 1626 high- and 631 low-quality files marked by our expert molecular biologist. Our results indicate that, although all of the methods tried offer comparable and acceptable performances, random forest decision tree algorithm with adapting boosting ensemble learning shows slightly higher prediction accuracy with as few as four features.
机译:对DNA序列的知识对于基础生物学研究是必不可少的。许多研究人员使用DNA测序进行各种目的,包括分子生物学研究和个体鉴定的序列比较。自动DNA测序装置使用四个彩色色谱图或基准呼叫信号来指示每个碱基通道的杂交强度。通常,每个基站位置处的峰的相对强度用于量化各个读数的质量和/或可靠性。但是,对整个DNA跟踪文件的整体质量评估仍然是一个公开问题。因此,原始DNA跟踪文件的分类为高或低质量是有效利用资源的重要问题。在这项研究中,我们使用了多种监督机器学习方法,包括逻辑回归和集合树决策树,以识别高或可接受的质量色谱图文件并比较其预测性能。为了测试和发展我们的想法,我们使用了由我们专家分子生物学家标志的1626个高和631个低质量文件组成的公共DNA跟踪存储库。我们的结果表明,尽管所有方法都尝试提供可比和可接受的表现,但随机林决策树算法随着调整升压集合学习显示略高的预测准确性,略高于四个特征。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号