Detecting near-duplicate text documents with a hybrid approach

Cihan Varol; Sairam Hari

首页> 外文期刊>Journal of Information Science >Detecting near-duplicate text documents with a hybrid approach

【24h】

Detecting near-duplicate text documents with a hybrid approach

机译：使用混合方法检测几乎重复的文本文档

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm's accuracy performance by 27% on average and achieved above 90% common shingles.

机译：近乎重复的数据不仅增加了大数据中信息处理的成本，而且还增加了决策时间。因此，检测和消除几乎相同的信息对于增强整体业务决策至关重要。为了识别大规模文本数据中的近重复项，混叠算法已被广泛使用。此算法基于在两组或更多组信息（例如在文档中）中标记的连续子序列的出现。换句话说，如果文档之间存在细微差异，则该算法的整体性能会降低。因此，为提高混叠算法的效率和准确性，我们提出了一种混合方法，该方法嵌入Jaro距离和单词使用频率的统计结果，以修复定义不正确的数据。在真实文本数据集中，所提出的混合方法平均将带状疱疹算法的准确度性能提高了27％，并实现了超过90％的普通带状疱疹。

著录项

来源
《Journal of Information Science》 |2015年第4期|405-414|共10页
作者
Cihan Varol; Sairam Hari;
展开▼
作者单位

Computer Science Department, Sam Houston State University, Huntsville, Texas, USA;

Sam Houston State University, Texas, USA;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
data cleansing; data quality; duplicate detection; Jaro distance; shingling;

机译：数据清理;数据质量;重复检测;Jaro距离;瓦;

相似文献

外文文献
中文文献
专利

1. Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash [J] . N. Rezaeian, G.M. Novikova Procedia Computer Science . 2017,第1期

机译：通过使用指纹算法Simhash检测俄语文档中的近重复项
2. Detecting near-duplicate documents using sentence-level features and supervised learning [J] . Yung-Shen Lin, Ting-Yi Liao, Shie-Jue Lee Expert Systems with Application . 2013,第5期

机译：使用句子级功能和监督学习来检测几乎重复的文档
3. A hybrid approach for text document clustering using Jaya optimization algorithm [J] . Thirumoorthy Karpagalingam, Muneeswaran Karuppaiah Expert systems with applications . 2021,第Sepa期

机译：Jaya优化算法的文本文档聚类混合方法
4. A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain [C] . M. Bhavani, V. A. Narayana, Gaddameedi Sreevani International Conference on Communications and Cyber Physical Engineering . 2020

机译：通过考虑图像，文本，文档和域的大小来检测近重Web文档的新方法
5. A hybrid approach to retrieving Web documents and semantic Web data. [D] . Immaneni, Trivikram. 2007

机译：检索Web文档和语义Web数据的混合方法。
6. Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents [O] . Stéphane M Meystre, Julien Thibault, Shuying Shen, 2010

机译：Textractor：用于药物和从临床文本文档中提取处方的理由的混合系统
7. Detecting near-duplicates in large-scale short text databases [O] . Caichun Gong, Yulan Huang, Xueqi Cheng Shuo Bai 2015

机译：检测大型短文本数据库中的近似重复项

Detecting near-duplicate text documents with a hybrid approach

摘要

著录项

相似文献

相关主题

期刊订阅