首页> 外文期刊>ACM Transactions on Information Systems >Word-Based Self-Indexes for Natural Language Text
【24h】

Word-Based Self-Indexes for Natural Language Text

机译:基于单词的自然语言文本自索引

获取原文
获取原文并翻译 | 示例
           

摘要

The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.
机译:倒排索引支持对自然语言文本集合进行高效的全文本搜索。在压缩文本上需要一些额外的空间,可以用这些空间来换取搜索速度。单词搜索通常速度很快,而短语搜索则需要更昂贵的交集。在本文中,我们介绍了另一种类型的索引。它使用基本上仅压缩文本所需的相同空间(压缩率约为35%)来替换文本。在此空间中,它不仅支持对任意段落进行解压缩,而且还支持有效的单词和短语搜索。查找短语时,搜索的速度比反向索引的搜索速度快几个数量级,而单字搜索的搜索空间却较小时,搜索速度仍然更快。我们的新索引在计算单词或短语的出现次数方面特别快。这对于计算单词或短语的相关性很有用。我们采用自我索引,该索引成功地索引了压缩空间内的任意字符串以处理大字母。然后,自然语言文本被视为单词序列,而不是字符序列,以实现基于单词的自我索引。我们设计了一种体系结构,该体系结构将可搜索序列与其呈现方式分开。这就允许像倒排索引一样,应用大小写折叠,阻止,删除停用词等。

著录项

  • 来源
    《ACM Transactions on Information Systems》 |2012年第1期|p.1.1-1.34|共34页
  • 作者单位

    Department of Computer Science, University of A Coruna, Facultade de Informatica, Campus de Elvina, s 15071, A Coruna,Spain;

    Department of Computer Science, University of A Coruna, Facultade de Informatica, Campus de Elvina, s 15071, A Coruna,Spain;

    Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile;

    School of Computer Science, University of Waterloo, Waterloo, ON, Canada;

    Department of Computer Science, University of A Coruna, Facultade de Informatica, Campus de Elvina, s 15071, A Coruna,Spain;

    Department of Computer Science, University of A Coruna, Facultade de Informatica, Campus de Elvina, s 15071, A Coruna,Spain;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    self-indexes; compressed data structures; inverted indexes;

    机译:自我索引;压缩数据结构;倒排索引;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号