首页> 外文会议>International Symposium on String Processing and Information Retrieval >SCM: Structural Contexts Model for Improving Compression in Semistructured Text Databases
【24h】

SCM: Structural Contexts Model for Improving Compression in Semistructured Text Databases

机译:SCM:用于改进半系统文本数据库压缩的结构背景模型

获取原文

摘要

We describe a compression model for semistructured documents, called Structural Contexts Model, which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate semiadaptive model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind the idea is that the distribution of all the texts that belong to a given structure type should be similar and different from that of other structure types. We test our idea using a word-based Huffman coding, which is the standard for compressing large natural language textual databases, and show that our compression method obtains significant improvements in compression ratios. We also analyze the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This technique gives an additional improvement over the plain technique. The comparison against existing prototypes shows that our method is a competitive choice for compressed text databases. Finally, we show how to apply SCM over text chunks, which allows one to adjust the different word frequencies as they change across the text collection.
机译:我们描述了一个被称为结构上下文模型的半系统文档的压缩模型,它利用通常在文本结构中隐含的上下文信息。该想法是使用单独的半拔除模型来压缩位于每个不同结构类型(例如,不同XML标记)内的文本。思想背后的直觉是,属于给定结构类型的所有文本的分布应该与其他结构类型的所有文本不同。我们使用基于词的霍夫曼编码来测试我们的想法,这是压缩大型自然语言文本数据库的标准,并显示我们的压缩方法获得压缩比的显着改进。我们还分析了存储单独模型的可能性如果不同结构类型的分布不够不同,并且呈现启发式合并模型,目的是最小化压缩数据库的总大小。该技术对普通技术提供了额外的改进。对现有原型的比较显示,我们的方法是压缩文本数据库的竞争选择。最后,我们展示了如何在文本块上应用SCM,这允许人们调整不同的单词频率,因为它们会在文本集合中更改。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号