Detecting near-duplicate documents using sentence-level features and supervised learning

Yung-Shen Lin; Ting-Yi Liao; Shie-Jue Lee

首页> 外文期刊>Expert Systems with Application >Detecting near-duplicate documents using sentence-level features and supervised learning

【24h】

Detecting near-duplicate documents using sentence-level features and supervised learning

机译：使用句子级功能和监督学习来检测几乎重复的文档

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

We present a novel method for detecting near-duplicates from a large collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is then applied and the similarity degree between the input document and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods. Experimental results show that our method is effective in near-duplicate document detection.

机译：我们提出了一种新颖的方法，用于从大量的文档中检测重复项。我们的方法涉及三个主要部分：特征选择，相似性度量和判别式推导。为了找到与输入文档几乎相同的副本，需要提取并预处理输入文档的每个句子，计算每个术语的权重，然后选择权重较高的术语作为句子的特征。结果，输入文档变成一组这样的特征。然后应用相似性度量，并计算输入文档与给定集合中每个文档之间的相似度。采用支持向量机（SVM）从训练模式集中学习判别函数，然后将其用于基于文档之间的相似度来确定文档是否与输入文档几乎重复。我们采用的句子级功能可以更好地揭示文档的特征。此外，通过支持向量机学习判别函数可以避免传统方法所需的反复试验。实验结果表明，该方法在近重复文档检测中是有效的。

著录项

来源
《Expert Systems with Application》 |2013年第5期|1467-1476|共10页
作者
Yung-Shen Lin; Ting-Yi Liao; Shie-Jue Lee;
展开▼
作者单位

Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 804, Taiwan;

Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 804, Taiwan;

Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 804, Taiwan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
near-duplicate; feature selection; similarity function; training data; support vector machine; discriminant function;

机译：几乎重复特征选择;相似度函数训练数据;支持向量机判别函数;

相似文献

外文文献
中文文献
专利

1. XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning [J] . Lavanya Pamulaparty, C.V. Guru Rao, M. Sreenivasa Rao Procedia Computer Science . 2015,第1期

机译：XNDDF：建立一种使用监督和无监督学习的灵活的近重复文档检测框架
2. Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash [J] . N. Rezaeian, G.M. Novikova Procedia Computer Science . 2017,第1期

机译：通过使用指纹算法Simhash检测俄语文档中的近重复项
3. Detecting near-duplicate text documents with a hybrid approach [J] . Cihan Varol, Sairam Hari Journal of Information Science . 2015,第4期

机译：使用混合方法检测几乎重复的文本文档
4. Aggregating sentence-level features for Chinese near-duplicate document detection [C] . Yan Liang, Yizheng Tao, Ning Feng, IEEE International Conference on Networking, Sensing and Control . 2017

机译：聚合句子级功能以进行中文近重复文档检测
5. Detecting targeted malicious email through supervised classification of persistent threat and recipient oriented features. [D] . Amin, Rohan Mahesh. 2010

机译：通过对持久性威胁和面向收件人的功能进行监督分类来检测目标恶意电子邮件。
6. Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation [O] . Chao Wei, Senlin Luo, Xincheng Ma, 2011

机译：局部嵌入自动编码器：一种半监督的流形学习的文档表示形式
7. XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning [O] . Pamulaparty Lavanya, Guru Rao C.V., Rao M. Sreenivasa 2015

机译：XNDDF：建立一种使用监督和无监督学习的灵活的近重复文档检测框架

Detecting near-duplicate documents using sentence-level features and supervised learning

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅