The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

Junxiu An; Pengsen Cheng

首页> 外文期刊>Journal of software >The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

【24h】

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

机译：基于编辑距离的中文重复网页检测算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

On one hand, redundant pages could increase searching burden of the search engine. On the other hand, they would lower the user's experience. So it is necessary to deal with the pages. To achieve near-replicas detection, most of the algorithms depend on web page content extraction currently. But the cost of content extraction is large and it is difficult. What's more, it becomes much harder to extract web content properly. This paper addresses these issues through the following ways: it gets the definition of the largest number of common character by taking antisense concept of edit distance; it suggests that the feature string of web page built by a previous Chinese character of period in simple processing text; and it utilizes the largest number of common character to calculate the overlap factor between the feature strings of web page. As a consequence, this paper hopes to achieve near-replicas detection in high noise environment, avoiding extracting the content of web page. The algorithm is proven efficient in our experiment testing: the recall rate of web pages reaches 96.7%, and the precision rate reaches 97.8%.

机译：一方面，多余的页面可能会增加搜索引擎的搜索负担。另一方面，它们会降低用户的体验。因此有必要处理页面。为了实现近副本检测，目前大多数算法都依赖于网页内容提取。但是内容提取的成本很大，并且很困难。而且，正确提取Web内容变得更加困难。本文通过以下方式解决了这些问题：通过采用编辑距离的反义概念来获得最大数量的公共字符的定义；建议在简单处理文本时使用句号的前一个汉字建立的网页特征字符串；它利用最大数量的公共字符来计算网页特征串之间的重叠因子。因此，本文希望在高噪声环境下实现近副本检测，避免提取网页内容。该算法在我们的实验测试中被证明是有效的：网页的召回率达到96.7％，准确率达到97.8％。

著录项

来源
《Journal of software》 |2013年第7期|1666-1670|共5页
作者
Junxiu An; Pengsen Cheng;
展开▼
作者单位

Chengdu University of Information Technology, Chengdu, P.R.China;

Chengdu University of Information Technology, Chengdu, P.R.China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Near-replicas detection; edit distance; the largest number of common character; feature string of web page;

机译：近副本检测;编辑距离;最多的共有字符;网页功能字符串;

相似文献

外文文献
中文文献
专利

1. The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance [J] . Junxiu An, Pengsen Cheng Journal of software . 2013,第7期

机译：基于编辑距离的中文重复网页检测算法
2. Frame Duplication Forgery Detection and Localization Algorithm Based on the Improved Levenshtein Distance [J] . Honge Ren, Walid Atwa, Haosu Zhang, Scientific programming . 2021,第a期

机译：基于改进的Levenshtein距离的框架复制伪造探测和定位算法
3. An image-based near-duplicate video retrieval and localization using improved Edit distance [J] . Liu Hao, Zhao Qingjie, Wang Hao, Multimedia Tools and Applications . 2017,第22期

机译：使用改进的编辑距离的基于图像的近重复视频检索和定位
4. Near-Duplicate Web Video Retrieval and Localization Using Improved Edit Distance [C] . Hao Liu, Qingjie Zhao, Hao Wang, Web technologies and applications . 2016

机译：使用改进的编辑距离进行近乎重复的Web视频检索和本地化
5. Metacognitive knowledge development and language learning in the context of web-based distance language learning: A multiple-case study of adult EFL learners in China. [D] . Fincham, Naiyi Xie. 2015

机译：基于网络的远程语言学习背景下的元认知知识发展和语言学习：对中国成年英语学习者的多案例研究。
6. Efficient sequential and parallel algorithms for finding edit distance based motifs [O] . Soumitra Pal, Peng Xiao, Sanguthevar Rajasekaran 2016

机译：高效的顺序和并行算法用于查找基于编辑距离的图案
7. Efficient sequential and parallel algorithms for finding edit distance based motifs [O] . 2016

机译：高效的顺序和并行算法，用于查找基于编辑距离的图案

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

摘要

著录项

相似文献

相关主题

期刊订阅