【24h】

Approximate string matching in DNA sequences

机译:DNA序列中匹配的近似字符串

获取原文

摘要

Approximate string matching on large DNA sequences data is very important in bioinformatics. Some studies have shown that suffix tree is an efficient data structure for approximate string matching. It performs better than suffix array if the data structure can be stored entirely in the memory. However; our study find that suffix array is much better than suffix tree for indexing the DNA sequences since the data structure has to be created and stored on the disk due to its size. We propose a novel auxiliary data structure which greatly improves the efficiency of suffix array in the approximate string matching problem in the external memory model. The second problem we have tackled is the parallel approximate matching in DNA sequence. We propose 2 novel parallel algorithms for this problem and implement them on a PC cluster. The result shows that when the error allowed is small, a direct partitioning of the array over the machines in the cluster is a more efficient approach. On the other hand, when the error allowed is large, partitioning the data over the machines is a better approach.
机译:大DNA序列数据上的近似字符串在生物信息学中非常重要。一些研究表明,后缀树是用于近似字符串匹配的有效数据结构。如果数据结构可以完全存储在存储器中,它会比后缀数组更好。然而;我们的研究发现后缀阵列比后缀树更好,用于索引DNA序列,因为必须创建数据结构并由于其尺寸而存储在磁盘上。我们提出了一种新颖的辅助数据结构,它大大提高了外部存储器模型中近似字符串匹配问题的后缀阵列的效率。我们解决的第二个问题是DNA序列中的平行近似匹配。我们为此问题提出了2个新的并行算法,并在PC集群上实现它们。结果表明,当允许的错误很小时,在群集中的机器上直接分区阵列是一种更有效的方法。另一方面,当允许的错误很大时,通过机器上的数据进行分区是更好的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号