首页> 外文会议>IEEE International Symposium on Information Theory >Sequence assembly from corrupted shotgun reads
【24h】

Sequence assembly from corrupted shotgun reads

机译:序列枪从corrupt弹枪损坏中读取

获取原文

摘要

The prevalent technique for DNA sequencing consists of two main steps: shotgun sequencing, where many randomly located fragments, called reads, are extracted from the overall sequence, followed by an assembly algorithm that aims to reconstruct the original sequence. There are many different technologies that generate the reads: widely-used second-generation methods create short reads with low error rates, while emerging third-generation methods create long reads with high error rates. Both error rates and error profiles differ among methods, so reconstruction algorithms are often tailored to specific shotgun sequencing technologies. As these methods change over time, a fundamental question is whether there exist reconstruction algorithms which are robust, i.e., which perform well under a wide range of error distributions. Here we study this question of sequence assembly from corrupted reads. We make no assumption on the types of errors in the reads, but only assume a bound on their magnitude. More precisely, for each read we assume that instead of receiving the true read with no errors, we receive a corrupted read which has edit distance at most ε times the length of the read from the true read. We show that if the reads are long enough and there are sufficiently many of them, then approximate reconstruction is possible: we construct a simple algorithm such that for almost all original sequences the output of the algorithm is a sequence whose edit distance from the original one is at most O(ε) times the length of the original sequence.
机译:DNA测序的普遍技术包括两个主要步骤:shot弹枪测序,在该序列中,从整个序列中提取了许多随机定位的片段(称为读取片段),然后采用了旨在重建原始序列的组装算法。产生读取的方法有多种:使用广泛的第二代方法创建的错误率较低的短读,而新兴的第三代方法创建的错误率较高的长读。错误率和错误配置文件在方法之间都不同,因此重建算法通常针对特定的shot弹枪测序技术进行定制。当这些方法随时间变化时,一个基本问题是是否存在鲁棒的重构算法,即在宽范围的误差分布下性能良好。在这里,我们从损坏的读取中研究这个序列组装问题。我们不对读取中的错误类型进行任何假设,而仅对它们的大小进行限制。更准确地说,对于每个读取,我们假设接收的是损坏的读取,而不是没有错误地读取真实读取的内容,而该读取距离的编辑距离最多是真实读取长度的ε倍。我们表明,如果读取足够长并且有足够多的读取,则可以进行近似重构:我们构建了一个简单的算法,使得对于几乎所有原始序列,该算法的输出均为一个序列,其与原始序列的编辑距离最多为原始序列长度的O(ε)倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号