首页> 外文OA文献 >Scalable succinct indexing for large text collections
【2h】

Scalable succinct indexing for large text collections

机译:可扩展的简洁索引,适用于大型文本集合

摘要

Self-indexes save space by emulating operations of traditional data structures using basic operations on bitvectors. Succinct text indexes provide full-text search functionality which is traditionally provided by suffix trees and suffix arrays for a given text, while using space equivalent to the compressed representation of the text. Succinct text indexes can therefore provide full-text search functionality over inputs much larger than what is viable using traditional uncompressed suffix-based data structures. Fields such as Information Retrieval involve the processing of massive text collections. However, the in-memory space requirements of succinct text indexes during construction have hampered their adoption for large text collections. One promising approach to support larger data sets is to avoid constructing the full suffix array by using alternative indexing representations. This thesis focuses on several aspects related to the scalability of text indexes to larger data sets. We identify practical improvements in the core building blocks of all succinct text indexing algorithms, and subsequently improve the index performance on large data sets. We evaluate our findings using several standard text collections and demonstrate: (1) the practical applications of our improved indexing techniques; and (2) that succinct text indexes are a practical alternative to inverted indexes for a variety of top-k ranked document retrieval problems.
机译:自索引通过使用对位向量的基本操作来模拟传统数据结构的操作来节省空间。简洁的文本索引提供了全文搜索功能,该功能通常由给定文本的后缀树和后缀数组提供,同时使用的空间等于文本的压缩表示形式。因此,简洁的文本索引可以在输入上提供全文搜索功能,其功能远大于使用传统的未压缩的基于后缀的数据结构所能提供的功能。诸如信息检索之类的字段涉及大量文本集合的处理。但是,在构造过程中,简洁文本索引在内存中的空间需求已阻碍了它们被大型文本集合采用。支持较大数据集的一种有前途的方法是避免通过使用替代索引表示来构造完整的后缀数组。本文着重于与文本索引可扩展到更大数据集有关的几个方面。我们在所有简洁文本索引算法的核心构建块中确定了实际的改进,并随后提高了对大型数据集的索引性能。我们使用几种标准的文本集来评估我们的发现,并证明:(1)我们改进的索引技术的实际应用; (2)对于各种排在前k位的文档检索问题,简洁的文本索引是倒排索引的一种实用替代方案。

著录项

  • 作者

    Petri M;

  • 作者单位
  • 年度 2013
  • 总页数
  • 原文格式 PDF
  • 正文语种
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号