首页> 美国卫生研究院文献>Bioinformatics >Identifying duplicate content using statistically improbable phrases
【2h】

Identifying duplicate content using statistically improbable phrases

机译:使用统计上不可能的短语识别重复内容

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

>Motivation: Document similarity metrics such as PubMed's ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication's total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content.>Results: When applied to MEDLINE citations, this method substantially improves upon previous algorithms in the detection of duplication citations, yielding a precision and recall of 78.9% (versus 50.3% for eTBLAST) and 99.6% (versus 99.8% for eTBLAST), respectively.>Availability: Similar citations identified by this work are freely accessible in the Déjà vu database, under the SIP discovery method category at >Contact:
机译:>动机:文档相似性度量标准(例如PubMed的“查找相关文章”功能)主要用于识别具有相似主题的研究,现在也可以用于在文献参考数据库中检测重复或抄袭的论文。但是,文档比较的CPU密集型性质使MEDLINE文本相似性研究仅限于摘要比较,摘要仅占出版物总文本的一小部分。将搜索范围扩展到包含在线搜索引擎存档的文本将大大提高比较能力。对于大规模研究,将直接引号中包含的短短语提交给搜索引擎以进行精确匹配,这对于单个查询和程序接口都是最佳的。我们已经导出了一种分析统计上不可能的短语(SIP)的方法,以帮助识别重复内容。>结果::当应用于MEDLINE引用时,该方法在检测重复引用时大大改进了以前的算法,从而产生了准确度和召回率分别为78.9%(eTBLAST为50.3%)和99.6%(eTBLAST为99.8%)。>可用性:在Déjàvu数据库中可以免费访问此工作确定的类似引文,在>联系方式的SIP发现方法类别下,

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号