首页> 外文OA文献 >Algorithms and Data Structures for Strings, Points and Integers:or, Points about Strings and Strings about Points
【2h】

Algorithms and Data Structures for Strings, Points and Integers:or, Points about Strings and Strings about Points

机译:字符串,点和整数的算法和数据结构:或者,关于点的字符串和字符串的点

摘要

This dissertation presents our research in the broad area of algorithms and data structures. More specifically, we show solutions for the following problems related to strings, points and integers. Results hold on the Word RAM and we measure space in w-bit words.Compressed Fingerprints. The Karp-Rabin fingerprint of a string is a useful type of hash value that has multiple applications due to its strong properties. Given a string S of length N compressed into a straight line program (SLP) of size n, we show a O(n) space data structure that supports fingerprint queries, retrieving the fingerprint of any substring of S. Queries are answered in O(lg N) time. If the compression is a Linear SLP (capturing LZ78 compression and variations), we get O(lg lg N) query time.Our structure matches the best known query time bound for random access in SLPs, and is the first for general (unbalanced) SLPs that answers fingerprint queries without decompressing any text. We also support longest common extension queries, returning the length ℓ that the substrings from two given positions in S are equal. Answers are correct w.h.p. and take time O(lg N lg ℓ) and O(lg lg N + lg ℓ lg lg ℓ) for SLPs and Linear SLPs, respectively.Dynamic Compression. In the dynamic relative compression scheme, we compress a string S of length N into n substrings of a given reference string of length r. We give data structures that maintain an asymptotically optimal compression in the scheme and support access, replace, insert and delete operations on S. Our solutions support each operation in O(lg n/ lg lg n+lg lg r) time and O(n+ r) space; or O(lg n/ lg lg n) time and O(n+ r lge r) space. They can be naturally generalized to compress multiple strings.Our solutions obtains almost-optimal bounds, and are the first to dynamically maintain a string under a compression scheme that can achieve better than entropy compression. We also give improved results for the substring concatenation problem, and an extension of our structure can be used as a black box to get an improved solution to the previously studied dynamic text static pattern problem.Compressed Pattern Matching. In the streaming model, input data flows past a client one item at a time, but is far too large for the client to store. The annotated streaming model extends the model by introducing a powerful but untrusted annotator (representing “the cloud”) that can annotate input elements with additional information, sent as one-way communication to the client. We generalize the annotated streaming model to be able to solve problems on strings and present a data structure that allows us to trade off client space and annotation size. This lets us exploit the power of the annotator.In compressed pattern matching we must report occurrences of a pattern of length m in a text compressed into n phrases (capturing LZ78 compression and variations). In the streaming model, any solution to the problem requires Ω(n) space. We show that the problem can be solved in O(lg n) client space in the annotated streaming model, using O(lg n) time and O(lg n) words of annotation per phrase. Our solution shows that the annotator let us solve previously impossible problems, and it is the first solution to a classic problem from combinatorial pattern matching in the annotated streaming model.Pattern Extraction. The problem of extracting important patterns from text has many diverse applications such as data mining, intrusion detection and genomic analysis. Consequently, there are many variations of the pattern extraction problem with different notions of patterns and importance measures. We study a natural variation where patterns must 1) contain at most k don’t cares that each match a single character, and 2) have at least q occurrences. Both k and q are input parameters.We show how to extract such patterns and their occurrences from a text of length n in O(nk+k3occ) time and space, where occ is the total number of pattern occurrences. Our bound is the first output-sensitive solution for any approximate variation of the pattern extraction problem, with all previous solutions requiring Ω(n2) time per reported pattern. Our algorithm is relatively simple, but requires a novel analysis technique that amortizes the cost of creating the index over the number of pattern occurrences.Compressed Point Sets. Orthogonal range searching on a set of points is a classic geometric data structure problem. Given a query range, solutions must either count or report the points inside the range. Variants of this problem has numerous classic solutions, typically storing the points in a tree.We show that almost any such classic data structure can be compressed without asymptotically increasing the time spent answering queries. This allows us to reduce the required space use if the point set contains geometric repetitions (copies of equal point set that are translated relative to each other). Our result captures most classic data structures, such as Range Trees, KD-trees, R-trees and Quad Trees. We also show a hierarchical clustering algorithm for ensuring that geometric repetitions are compressed.Points with Colors. Colored orthogonal range searching is a natural generalization of orthogonal range searching which allows us to perform statistic analysis of a point set. We must store n points that each have a color (sometimes called a category) and support queries that either count or report the distinct colors of the points inside a query range.We show data structures that support both types of queries in sublinear time, storing two-dimensional points in linear space and high-dimensional points in almost-linear space. These are the first (almost) linear space solutions with sublinear query time. We also give the first dynamic solution with sublinear query time for any dimensionality. Previous solutions answer queries faster, but require much more space.Points with Weights in Practice. If points are each assigned a weight, it is natural to consider the threshold range counting problem. A data structure must store the points and be able to count the number of points within a query range with a weight exceeding some threshold. This query appears naturally in a software system built by Milestone Systems, and allows detecting motion in video from surveillance cameras.We implement a prototype of an index for 3-dimensional points that use little space and answers threshold queries efficiently. In experiments on realistic data sets, our prototype shows a speedup of at least a factor 30 at the expense of 10% additional space use compared to the previous approach. An optimized version of our proposed index is implemented in the latest version of the Milestone Systems software system.Finger Predecessor. The predecessor problem is to store a set of n integers from a universe of size N to support predecessor queries, returning the largest integer in the set smaller than a given integer q. We study a variation where the query additionally receives a finger to an integer ℓ in the set from which to start the search.We show a linear space data structure that answers such finger predecessor queries in O(lg lg |ℓ - q|) time. This generalizes and improves the O(lg lgN) time solutions for the standard predecessor problem. Our data structure is the first with a query time that only depends on the numerical distance between the finger and the query integer.Dynamic Partial Sums. The well-studied partial sums problem is to store a sequence of n integers with support for sum and search queries. The sequence is static in the sense that its length cannot change, but the update operation can be used to change the value of an integer in the sequence by a given value. There are matching lower and upper bounds showing that the problem can be solved on the w-bit Word RAM in linear space and _(lg n= lg(w=_)) time per operation, where _ is the maximum number of bits allowed in updates.As a natural generalization we consider dynamic partial sums, allowing insertions and deletions in the sequence. Our solution requires linear space and supports all operations in optimal worst-case time Θ(lg n/ lg(w/δ)), matching lower bounds for all supported operations. Our data structure is the first dynamic partial sums solution that matches the lower bounds, and the first to support storing integers of more than lg w bits.
机译:本文提出了我们在算法和数据结构领域的研究。更具体地说,我们为与字符串,点和整数有关的以下问题提供了解决方案。结果保持在Word RAM上,我们以w位字为单位测量空间。字符串的Karp-Rabin指纹是一种有用的哈希值类型,由于其强大的属性而具有多种应用。给定长度为N的字符串S压缩为大小为n的直线程序(SLP),我们显示了一个O(n)空间数据结构,该结构支持指纹查询,检索S的任何子字符串的指纹。查询以O( lg N)时间。如果压缩是线性SLP(捕获LZ78压缩和变化),则得到O(lg lg N)查询时间,我们的结构与SLP中随机访问的最佳已知查询时间匹配,并且是常规(不平衡)的第一个查询时间可以在不解压缩任何文本的情况下回答指纹查询的SLP。我们还支持最长的公共扩展查询,返回S中两个给定位置的子串相等的长度ℓ。答案是正确的分别为SLP和线性SLP花费时间O(lg N lgℓ)和O(lg lg N + lgℓlg lgℓ)。在动态相对压缩方案中,我们将长度为N的字符串S压缩为给定长度为r的参考字符串的n个子字符串。我们提供的数据结构可在方案中保持渐近最优压缩,并支持对S的访问,替换,插入和删除操作。我们的解决方案支持O(lg n / lg lg n + lg lg r)时间和O(n + r)空间;或O(lg n / lg lg n)时间和O(n + r lge r)空间。它们可以自然地概括为压缩多个字符串。我们的解决方案获得了几乎最优的边界,并且是第一个在比熵压缩更好的压缩方案下动态维护字符串的解决方案。我们还为子字符串连接问题提供了改进的结果,并且我们结构的扩展可以用作黑盒,从而获得对先前研究的动态文本静态模式问题的改进解决方案。压缩模式匹配。在流模型中,输入数据一次仅流经一个客户端一项,但是对于客户端而言太大了。带注释的流模型通过引入功能强大但不受信任的注释器(表示“云”)扩展了模型,该注释器可以使用附加信息对输入元素进行注释,并以单向通信的形式发送给客户端。我们对带注释的流模型进行了概括,以能够解决字符串问题,并提出了一种数据结构,使我们可以权衡客户端空间和注释大小。在压缩模式匹配中,我们必须报告在压缩为n个短语的文本中出现长度为m的模式的情况(捕获LZ78压缩和变化)。在流模型中,任何解决问题的方法都需要Ω(n)空间。我们表明,使用带注释的O(lg n)时间和O(lg n)个单词每个短语,可以在带注释的流模型中的O(lg n)客户空间中解决该问题。我们的解决方案表明,注释器使我们能够解决以前无法解决的问题,这是从带注释的流模型中组合模式匹配中解决经典问题的第一个解决方案。从文本中提取重要模式的问题有许多不同的应用,例如数据挖掘,入侵检测和基因组分析。因此,模式提取问题有许多变化,其中模式的概念和重要性度量不同。我们研究了一种自然变异,其中模式必须1)最多包含k个,而不必关心每个匹配单个字符,并且2)至少出现q次。 k和q都是输入参数。我们展示了如何从O(nk + k3occ)时空中长度为n的文本中提取此类模式及其出现,其中occ是模式出现的总数。对于任何近似的模式提取问题变化,我们的界线都是第一个输出敏感解决方案,所有先前的解决方案每个报告的模式都需要Ω(n2)时间。我们的算法相对简单,但是需要一种新颖的分析技术,该技术可以在模式出现的次数上分摊创建索引的成本。在一组点上进行正交范围搜索是一个经典的几何数据结构问题。给定查询范围,解决方案必须计算或报告该范围内的点。这个问题的变体有很多经典的解决方案,通常将点存储在树中。我们证明,几乎任何这样的经典数据结构都可以压缩,而不会逐渐增加回答查询所花费的时间。如果点集包含几何重复(相等点集的副本相对于彼此平移),则这可以减少所需的空间使用。我们的结果捕获了最经典的数据结构,例如范围树,KD树,R树和四叉树。我们还展示了一种层次聚类算法,可确保几何重复被压缩。带颜色的点。有色正交范围搜索是正交范围搜索的自然概括,它使我们能够对点集进行统计分析。我们必须存储n个点,每个点都有一个颜色(有时称为类别),并支持对查询范围内点的不同颜色进行计数或报告的查询。我们显示了在亚线性时间中支持两种类型的查询的数据结构,线性空间中的二维点和几乎线性空间中的高维点。这些是具有次线性查询时间的第一个(几乎)线性空间解决方案。对于任何维度,我们还给出了具有亚线性查询时间的第一个动态解决方案。以前的解决方案可以更快地回答查询,但需要更多的空间。如果为每个点分配了权重,则自然会考虑阈值范围计数问题。数据结构必须存储这些点,并且能够计算查询范围内权重超过某个阈值的点数。该查询自然出现在Milestone Systems构建的软件系统中,并且可以检测来自监视摄像机的视频中的运动。我们实现了3维点索引的原型,该索引只占用很少的空间并有效地回答阈值查询。在实际数据集上的实验中,我们的原型显示出与以前的方法相比,速度提高了至少30倍,但占用了10%的额外空间。我们建议的索引的优化版本已在Milestone Systems软件系统的最新版本中实现.Finger前身。前辈问题是存储大小为N的Universe中的一组n个整数以支持前辈查询,并返回该集合中小于给定整数q的最大整数。我们研究了一种变体,其中查询从集合中另外接收到一个整数an来开始搜索。我们展示了线性空间数据结构,该结构在O(lg lg |ℓ-q |)时间内回答了此类手指的前任查询。这归纳并改进了标准前任问题的O(lg lgN)时间解。我们的数据结构是第一个具有查询时间的数据结构,该查询时间仅取决于手指与查询整数之间的数字距离。充分研究的部分和问题是存储n个整数的序列,并支持求和和搜索查询。就序列的长度无法更改而言,该序列是静态的,但可以使用更新操作将序列中整数的值更改给定值。有匹配的上下限,表明可以在线性空间和每个操作_(lg n = lg(w = _))时间的w位Word RAM上解决问题,其中_是允许的最大位数作为自然概括,我们考虑动态部分和,允许在序列中进行插入和删除。我们的解决方案需要线性空间,并在最佳最坏情况时间Θ(lg n / lg(w /δ))中支持所有操作,并为所有支持的操作匹配下限。我们的数据结构是第一个匹配下限的动态部分和解,也是第一个支持存储大于lg w位的整数的解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号