首页> 外文会议>International conference on database systems for advanced applications >Tree Contraction for Compressed Suffix Arrays on Modern Processors
【24h】

Tree Contraction for Compressed Suffix Arrays on Modern Processors

机译:在现代处理器上压缩后缀阵列的树收缩

获取原文

摘要

We propose a novel processor-aware compaction technique for pattern matching that is widely-used in databases, information retrieval, and text mining. As the amount of data increases, it is getting important to efficiently store data on memory. A compressed suffix array (CSA) is a compact data structure for in-memory pattern matching. However, CSA suffers from tremendous processor penalties, such as a flood of instructions and cache/TLB misses due to the lack of processor-aware design. To mitigate these penalties, we propose a novel compaction technique for CSA, called suffix trie contraction (STC). The frequently accessed suffixes of CSA are transformed to a trie (e.g., a suffix trie), and then inter-connected nodes in the trie are repeatedly 'contracted' to a single node, which enables lightweight sequential scans in a processor-friendly way. In detail, STC consists of two contraction techniques: fixed-length path contraction (FPC) and sub-tree contraction (SC). FPC is applied to the parts with a few branches in the trie, and SC is applied to the parts with many branches. Our experiment results indicate that FPC outperforms naive CSA by two orders of magnitude for short pattern queries and by three times for long pattern queries. As the number of branches inside the trie increases, SC gradually becomes superior to CSA and FPC for short pattern queries. Finally, the latency and throughput of STC are 7 times and 72 times better than those of CSA for the TREC test data set at the expense of additional 7.1 % space overhead.
机译:我们提出了一种新颖的处理器感知压实技术,用于模式匹配,这些压缩技术在数据库,信息检索和文本挖掘中广泛使用。随着数据量的增加,有效地存储在内存上的数据是很重要的。压缩后缀数组(CSA)是用于内存模式匹配的紧凑数据结构。然而,CSA由于缺乏处理器感知设计而遭受巨大的处理器处罚,例如洪水和缓存/ TLB未命中。为了减轻这些处罚,我们为CSA提出了一种新的压实技术,称为后缀Trie收缩(STC)。 CSA的频繁访问后缀被转换为TRIE(例如,后缀TRIE),然后TRIE中的连接的连接节点重复地“收缩”到单个节点,这使得以处理器友好的方式实现轻量级顺序扫描。详细地,STC由两种收缩技术组成:固定长度路径收缩(FPC)和子树收缩(SC)。 FPC应用于Trie中有几个分支的部件,并且SC被施加到具有许多分支的部件。我们的实验结果表明,FPC以短图案查询的两个数量级和长图案查询的三次优于幼稚的CSA。随着TRI内部的分支的数量增加,SC逐渐变得优于CSA和FPC,用于短图案查询。最后,STC的延迟和吞吐量比CSA为TREC测试数据的延迟和吞吐量为72倍,以额外的7.1%空间开销。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号