首页> 外文学位 >Efficient computation of regularities in strings and applications .
【24h】

Efficient computation of regularities in strings and applications .

机译:字符串和应用中规则的有效计算。

获取原文
获取原文并翻译 | 示例

摘要

The first part of the thesis is the development of space and time efficient nonextendible (NE) and supernonextendible (SNE) repeats algorithms RPT, shown to be more efficient than previous methods based on tests using different real data sets. In particular, we describe four variants of a new fast algorithm RPT1 that, based on suffix array construction, computes all the complete NE repeats in a given string x whose length (period) p ≥ pmin, where pmin ≥ 1 is a user-specified minimum. RPT1 uses 5n bytes of space directly, but requires the LCP array, whose construction needs 6n bytes. The variants RPT1-3 and RPT1-4 execute in O( n) time independent of alphabet size and are faster than the two other algorithms previously proposed for this problem. To provide a basis of comparison for RPT1, we also describe a straightforward algorithm RPT2 that computes complete NE repeats without any recourse to suffix arrays and whose total space requirement is only 5n bytes; however, this algorithm is slower than RPT1. Furthermore, we describe new fast algorithms RPT3 for computing all complete SNE repeats in x. Of these, RPT3-2 executes in theta(n) time independent of alphabet size, thus asymptotically faster than the methods previously proposed. We conclude with a brief discussion of applications to bioinformatics and data compression.;Finally, the third part of the thesis provides a convenient framework for comparing the LZ factorization algorithms which are used in the computation of regularities in strings rather than in the traditional application to text compression. LZ factorization is the computational bottleneck 'in numerous string processing algorithms, especially in regularity studies, such as computing, repetitions, runs, repeats with fixed gap, branching repeats, sequence alignment, local periods, and data compression. Since branching repeats, sequence alignment, local periods, and data compression. Since 1977, when Ziv and Lempel described a kind of string factorization useful for data compression, there has been a succession of algorithms proposed for computing "LZ factorization". In particular, there have been several recent algorithms proposed that extend the usefulness of LZ factorization, especially to the computation of runs in a string x. We choose these algorithms and analyze each algorithm separately, and remark on them by comparing some of their important aspects, for example, additional space required and handling mechanism. We also address their output format differences and some special features. We then provide a complete theoretical comparison of their time and space efficiency. We conduct intensive testing on both time and space performance and analyze the results carefully to draw conclusions in which situations these algorithms perform best. (Abstract shortened by UMI.);The second part of the thesis deals with the issue of finding the NE multirepeats in a set of N strings of average length n under various constraints. A multirepeat is a repeat that occurs at least m times (m ≥ 2) in each of at least q ≥ 1 strings in a given set of strings. We show that RPT1 can be extended to locate the multirepeats based on the investigation of the properties of the multirepeats and various strategies. We describe algorithms to find complete NE multirepeats, first with no restriction on "gap length" (that is, the gap between occurrences of the multirepeat), then with bounded gaps. For the first problem, we propose two algorithms with worst-case time complexities O(Nn+alpha log 2 N) and O(Nn+alpha) that use 9Nn and 10Nn bytes of space, respectively, where a is the alphabet size. For the second problem, we describe an algorithm with worst-case time complexity O(RNn) that requires approximately 10Nn bytes, where R is the number of multirepeats output. We remark that if we set the min and max constraints on gaps equal to zero in this algorithm, we can find all repetitions (tandem repeats) in arbitrary subsets of a given set. We demonstrate that our algorithms are faster, more flexible and much more space efficient than algorithms recently proposed for this problem.
机译:论文的第一部分是时空高效不可扩展(NE)和超不可扩展(SNE)重复算法RPT的开发,与以前的基于使用不同真实数据集进行测试的方法相比,该算法表现出更高的效率。特别是,我们描述了一种新的快速算法RPT1的四个变体,它基于后缀数组构造,计算给定字符串x中所有完整的NE重复,字符串x的长度(周期)p≥pmin,其中pmin≥1是用户指定的最低。 RPT1直接使用5n字节的空间,但需要LCP阵列,其构造需要6n字节。变体RPT1-3和RPT1-4在O(n)时间内执行,与字母大小无关,并且比先前针对此问题提出的其他两种算法更快。为了提供RPT1的比较基础,我们还描述了一种简单的算法RPT2,该算法可计算完整的NE重复,而无需求助于后缀数组,并且其总空间需求仅为5n字节。但是,此算法比RPT1慢。此外,我们介绍了用于计算x中所有完整SNE重复的新快速算法RPT3。其中,RPT3-2在theta(n)时间内执行,而与字母大小无关,因此比以前提出的方法渐近更快。最后,我们简要讨论了在生物信息学和数据压缩中的应用。最后,论文的第三部分提供了一个方便的框架,用于比较用于字符串规则性计算的LZ分解算法,而不是传统的应用。文字压缩。 LZ分解是众多字符串处理算法中的计算瓶颈,尤其是在规律性研究中,例如计算,重复,运行,具有固定间隔的重复,分支重复,序列比对,局部时间段和数据压缩。由于重复分支,因此需要进行序列比对,局部时间段和数据压缩。自1977年Ziv和Lempel描述了一种可用于数据压缩的字符串分解以来,就提出了一系列用于计算“ LZ分解”的算法。特别是,最近提出了几种算法,这些算法扩展了LZ因式分解的用途,尤其是扩展了对字符串x中游程的计算。我们选择这些算法并分别分析每种算法,然后通过比较它们的一些重要方面(例如,所需的额外空间和处理机制)对它们进行标记。我们还将解决它们的输出格式差异和一些特殊功能。然后,我们提供了它们的时间和空间效率的完整理论比较。我们对时间和空间性能进行了严格的测试,并仔细分析了结果以得出结论,这些算法在哪种情况下效果最佳。 (本文的摘要由UMI简化。);论文的第二部分讨论了在各种约束条件下,在一组N个平均长度为n的字符串中查找NE重复的问题。多重复是在给定的一组弦中的至少q≥1个弦中的每个弦中发生至少m次(m≥2)的重复。我们显示RPT1可以扩展到基于重复的属性和各种策略的调查定位多重重复。我们描述了找到完整的NE多重复的算法,首先对“间隙长度”(即,多次重复出现之间的间隙)没有限制,然后对有限的间隙进行限制。对于第一个问题,我们提出两种具有最坏情况时间复杂度O(Nn + alpha log 2 N)和O(Nn + alpha)的算法,分别使用9Nn和10Nn个字节的空间,其中a是字母大小。对于第二个问题,我们描述一种算法,该算法具有最坏情况下的时间复杂度O(RNn),大约需要10Nn个字节,其中R是输出的重复数。我们指出,如果在此算法中将间隙的最小和最大约束设置为零,则可以在给定集合的任意子集中找到所有重复(串联重复)。我们证明,与最近针对此问题提出的算法相比,我们的算法更快,更灵活且空间效率更高。

著录项

  • 作者

    Yusufu, Munina.;

  • 作者单位

    McMaster University (Canada).;

  • 授予单位 McMaster University (Canada).;
  • 学科 Mathematics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 122 p.
  • 总页数 122
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号