首页> 外文会议>International world wide web conference;WWW 09 >Detecting the Origin of Text Segments Efficiently
【24h】

Detecting the Origin of Text Segments Efficiently

机译:有效地检测文本段的来源

获取原文
获取外文期刊封面目录资料

摘要

In the origin detection problem an algorithm is given a set S of documents, ordered by creation time, and a query document D. It needs to output for every consecutive sequence of k alphanumeric terms in D the earliest document in S in which the sequence appeared (if such a document exists). Algorithms for the origin detection problem can, for example, be used to detect the "origin" of text segments in D and thus to detect novel content in D. They can also find the document from which the author of D has copied the most (or show that D is mostly original.)We propose novel algorithms for this problem and evaluate them together with a large number of previously published algorithms. Our results show that (1) detecting the origin of text segments efficiently can be done with very high accuracy even when the space used is less than 1% of the size of the documents in S, (2) the precision degrades smoothly with the amount of available space, (3) various estimation techniques can be used to increase the performance of the algorithms.
机译:在原点检测问题中,给算法一个文档集S(按创建时间排序)和一个查询文档D。它需要为D中的k个字母数字项的每个连续序列输出S中出现该序列的最早文档。 (如果存在这样的文件)。例如,可以将用于原点检测问题的算法用于检测D中文本段的“原点”,从而检测D中的新颖内容。它们还可以找到D的作者复制最多的文档(或表明D大部分是原始的。) 我们针对此问题提出了新颖的算法,并与大量先前发布的算法一起对其进行了评估。我们的结果表明:(1)即使所使用的空间小于S中文档大小的1%,也可以以非常高的精度高效地检测文本段的起源,(2)精度随着数量的增加而平滑下降。在可用空间方面,(3)各种估算技术可用于提高算法的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号