首页> 外文会议>Proceedings of the 24th Australasian conference on Computer science >Efficiency of data structures for detecting overlaps in digital documents
【24h】

Efficiency of data structures for detecting overlaps in digital documents

机译:用于检测数字文档中重叠的数据结构的效率

获取原文
获取原文并翻译 | 示例

摘要

This paper analyses the efficiency of different data structures for detecting overlap in digital documents. Most existing approaches use some hash function to reduce the space requirements for their indices of chunks. Since a hash function can produce the same value for different chunks, false matches are possible. In this paper we propose an algorithm that can be used for eliminating those false matches. This algorithm uses a suffix tree structure, which is space consuming. We define a modified suffix tree that only considers chunks starting at the beginning of words and we show how the algorithm can work on this structure. We can alternatively reduce space requirements of a suffix tree by converting it to a directed acyclic graph. We show that suffix link information can be preserved in this new structure and the matching statistics algorithm still works with those modifications that we propose.
机译:本文分析了不同数据结构用于检测数字文档重叠的效率。大多数现有方法使用某种哈希函数来减少其块索引的空间要求。由于哈希函数可以为不同的块产生相同的值,因此可能会出现错误匹配。在本文中,我们提出了一种可用于消除那些错误匹配的算法。该算法使用后缀树结构,这会占用空间。我们定义了一个修改后缀树,该树仅考虑单词开头的块,并展示了算法如何在此结构上工作。我们可以通过将后缀树转换为有向无环图来减少其空间需求。我们显示后缀链接信息可以保留在此新结构中,并且匹配统计算法仍可与我们建议的那些修改一起使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号