We show how to represent sets in a linear space data structure such thatexpressions involving unions and intersections of sets can be computed in aworst-case efficient way. This problem has applications in e.g. informationretrieval and database systems. We mainly consider the RAM model ofcomputation, and sets of machine words, but also state our results in the I/Omodel. On a RAM with word size $w$, a special case of our result is that theintersection of $m$ (preprocessed) sets, containing $n$ elements in total, canbe computed in expected time $O(n (\log w)^2 / w + km)$, where $k$ is thenumber of elements in the intersection. If the first of the two termsdominates, this is a factor $w^{1-o(1)}$ faster than the standard solution ofmerging sorted lists. We show a cell probe lower bound of time $\Omega(n/(w m\log m)+ (1-\tfrac{\log k}{w}) k)$, meaning that our upper bound is nearlyoptimal for small $m$. Our algorithm uses a novel combination of approximateset representations and word-level parallelism.
展开▼
机译:我们展示了如何在线性空间数据结构中表示集合,从而可以用最坏情况下有效的方式来计算涉及集合的并集和交集的表达式。这个问题在例如信息检索和数据库系统。我们主要考虑计算的RAM模型和机器字集,但也将结果记录在I / O模型中。在字大小为$ w $的RAM上,我们结果的一个特殊情况是,可以在预期的时间$ O(n(\ log w)中计算出总共包含$ n $个元素的$ m $(预处理)集合的交集^ 2 / w + km)$,其中$ k $是交点中的元素数量。如果两个术语中的第一个占主导,则这比合并排序列表的标准解决方案快$ w ^ {1-o(1)} $倍。我们显示了单元探针的时间下限$ \ Omega(n /(wm \ log m)+(1- \ tfrac {\ log k} {w})k)$,这意味着对于小$,我们的上限几乎是最优的m $。我们的算法使用了近似集表示和词级并行性的新颖组合。
展开▼