...
首页> 外文期刊>Journal of Information Science >Detecting near-duplicate text documents with a hybrid approach
【24h】

Detecting near-duplicate text documents with a hybrid approach

机译:使用混合方法检测几乎重复的文本文档

获取原文
获取原文并翻译 | 示例
           

摘要

Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm's accuracy performance by 27% on average and achieved above 90% common shingles.
机译:近乎重复的数据不仅增加了大数据中信息处理的成本,而且还增加了决策时间。因此,检测和消除几乎相同的信息对于增强整体业务决策至关重要。为了识别大规模文本数据中的近重复项,混叠算法已被广泛使用。此算法基于在两组或更多组信息(例如在文档中)中标记的连续子序列的出现。换句话说,如果文档之间存在细微差异,则该算法的整体性能会降低。因此,为提高混叠算法的效率和准确性,我们提出了一种混合方法,该方法嵌入Jaro距离和单词使用频率的统计结果,以修复定义不正确的数据。在真实文本数据集中,所提出的混合方法平均将带状疱疹算法的准确度性能提高了27%,并实现了超过90%的普通带状疱疹。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号