首页> 外文期刊>Bioinformatics >A machine-learning approach to combined evidence validation of genome assemblies
【24h】

A machine-learning approach to combined evidence validation of genome assemblies

机译:一种用于组合证据验证基因组装配的机器学习方法

获取原文
获取原文并翻译 | 示例
       

摘要

Motivation: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.
机译:动机:虽然通常将“基因组序列”称为单个,完整且连续的DNA字符串,但实际上它是数百万个小的,部分重叠的DNA片段的集合。复杂的计算机算法(汇编程序和脚手架)将这些DNA片段合并为重叠群,并使用衍生自大插入DNA库的成对末端序列将这些重叠群放入序列脚手架。此自动化过程中的每个步骤都容易产生错误。因此,最终的装配草图(在实践中)仅表示可能需要进一步验证的装配。如果研究人员要从组装后的序列数据中得出可靠的结论,则知道组装件的哪些部分可能没有错误至关重要。结果:我们开发了一种机器学习方法来检测顺序装配中的装配错误。各种研究人员已经提出了几种用于组装验证的计算机方法。我们使用三个基准的果蝇草图基因组,对这些技术以及我们提出的一些新措施进行了评估,包括良好/不良覆盖率(GMB),良好/不良比率(RGB),平均Z分数( AZ)和平均绝对Z分数(ASZ)。我们的结果表明,GMB措施在灵敏度和针对组装错误检测的特异性方面均优于其他措施。然而,没有一种方法能很好地可靠地检测需要进一步实验验证的基因组区域。为了利用所有这些措施的优势,我们开发了一种新颖的机器学习方法,该方法将这些单独的措施结合起来以实现更高的预测准确性(即大于90%)。我们的综合证据方法避免了单个测量所需的许多参数的困难且经常性的临时选择,并显着提高了基准数据集的整体精度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号