首页> 外文期刊>Information Systems >Practical compressed string dictionaries
【24h】

Practical compressed string dictionaries

机译:实用的压缩字符串字典

获取原文
获取原文并翻译 | 示例
       

摘要

The need to store and query a set of strings - a string dictionary - arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains. (C) 2015 Elsevier Ltd. All rights reserved.
机译:在许多类型的应用程序中,都需要存储和查询一组字符串-字符串字典。传统上,这些字符串字典只占总空间预算的一小部分(例如,在自然语言处理中或在为文本集合建立索引时),但Web引擎,语义Web(RDF)图,生物信息学以及许多其他应用中的最新应用大字符串字典,其大小占整个数据的很大一部分。在这些情况下,字符串字典管理本身就是可伸缩性问题。本文重点讨论在压缩的主内存空间中管理大型静态字符串字典的问题。我们将重新研究诸如散列,尝试和前端编码之类的字符串字典的经典解决方案,并通过使用压缩技术对其进行改进。我们还将介绍一些新颖的字符串字典表示形式,这些表示形式建立在简洁的数据结构和全文本索引的最新进展之上。在现实世界中的字符串字典形成的异构测试床上,对所有这些结构进行了经验比较。我们显示压缩的表示形式可能只使用原始字典大小的5%,而在几微秒内支持查找操作。在许多情况下,这些数字优于最新的时空权衡。此外,我们增强了一些表示形式,以提供基于前缀和子字符串的搜索,这些搜索也具有竞争力。结果表明,压缩字符串字典是不同领域中各种数据密集型应用程序的有用构建块。 (C)2015 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号