首页> 外文会议>International conference on big data analytics >Deep Learning in the Domain of Near-Duplicate Document Detection
【24h】

Deep Learning in the Domain of Near-Duplicate Document Detection

机译:几乎重复文档检测领域的深度学习

获取原文

摘要

Increasing of web users due to the popularity of the internet increases the digital documents on the web, and among them many are duplicates and near-duplicates. Identifying duplicate and near-duplicate documents in a huge collection is a significant problem with widespread application and hence, detection and elimination of those documents are the need of the day. This paper proposes a technique to detect the near-duplicate documents on the web which has four main aspects: the first aspect is related to the selection of the important terms from a corpus of documents by developing a new correlation-based feature selection (CBFS) mechanism which enhance the performance of the classifier. The second aspect is to compute the similarity scores between each pair of documents of the corpus. The third aspect concerns with combining these similarity scores with the class label of each pair of documents to generate the feature vector for training the Multi-layer ELM (deep learning architecture) and other established classifiers and the fourth and final aspect introduces a heuristics method to rank the near-duplicate documents based on their similarity scores. The empirical results on DUC datasets witness the effectiveness of the proposed approach using Multilayer ELM as highly appreciable compared to other state-of-the-art classifiers including the deep learning classifiers.
机译:由于互联网的普及,网络用户的增加增加了网络上的数字文档,其中许多是重复的和几乎重复的。识别大量集合中的重复和几乎重复的文档是广泛应用的一个重要问题,因此,检测和消除这些文档已成为当今的需要。本文提出了一种检测网络上几乎重复的文档的技术,该技术具有四个主要方面:第一个方面与通过开发新的基于相关的特征选择(CBFS)从文档语料库中选择重要术语有关。增强分类器性能的机制。第二方面是计算语料库的每对文档之间的相似度得分。第三方面涉及将这些相似性得分与每对文档的类别标签相结合以生成用于训练多层ELM(深度学习体系结构)和其他已建立的分类器的特征向量,第四方面和最后一个方面介绍了一种启发式方法来根据相似度得分对几乎重复的文档进行排名。 DUC数据集上的经验结果证明,与包括深度学习分类器在内的其他最新分类器相比,使用多层ELM提出的方法的有效性非常可观。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号