The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

Junxiu An; Pengsen Cheng

首页> 外文期刊>Journal of software >The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

【24h】

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

机译：基于编辑距离的中文重复网页检测算法

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

On one hand, redundant pages could increasesearching burden of the search engine. On the other hand,they would lower the user’s experience. So it is necessary todeal with the pages. To achieve near-replicas detection, mostof the algorithms depend on web page content extractioncurrently. But the cost of content extraction is large and it isdifficult. What’s more, it becomes much harder to extractweb content properly. This paper addresses these issuesthrough the following ways: it gets the definition of thelargest number of common character by taking antisenseconcept of edit distance; it suggests that the feature string ofweb page built by a previous Chinese character of period insimple processing text; and it utilizes the largest number ofcommon character to calculate the overlap factor betweenthe feature strings of web page. As a consequence, thispaper hopes to achieve near-replicas detection in high noiseenvironment, avoiding extracting the content of web page.The algorithm is proven efficient in our experiment testing:the recall rate of web pages reaches 96.7%, and theprecision rate reaches 97.8%.

机译：一方面，多余的页面会增加搜索引擎的搜索负担。另一方面，它们会降低用户的体验。因此有必要处理页面。为了实现近副本检测，目前大多数算法都依赖于网页内容的提取。但是，内容提取的成本高且难度大。而且，正确提取网络内容变得更加困难。本文通过以下方式解决了这些问题：通过编辑距离的反义概念得到最大数量的公共字符的定义;提示以句号前一个汉字建立的网页特征字符串处理文本不简单;它利用最大数量的公共字符来计算网页特征串之间的重叠因子。因此，本文希望在高噪声环境下实现近副本检测，避免提取网页内容。该算法在实验测试中被证明是有效的：网页的查全率达到96.7％，准确率达到97.8％。。

著录项

来源
《Journal of software》 |2013年第7期|共5页
作者
Junxiu An; Pengsen Cheng;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance [J] . Junxiu An, Pengsen Cheng Journal of software . 2013,第7期

机译：基于编辑距离的中文重复网页检测算法
2. Frame Duplication Forgery Detection and Localization Algorithm Based on the Improved Levenshtein Distance [J] . Honge Ren, Walid Atwa, Haosu Zhang, Scientific programming . 2021,第a期

机译：基于改进的Levenshtein距离的框架复制伪造探测和定位算法
3. An image-based near-duplicate video retrieval and localization using improved Edit distance [J] . Liu Hao, Zhao Qingjie, Wang Hao, Multimedia Tools and Applications . 2017,第22期

机译：使用改进的编辑距离的基于图像的近重复视频检索和定位
4. Near-Duplicate Web Video Retrieval and Localization Using Improved Edit Distance [C] . Hao Liu, Qingjie Zhao, Hao Wang, Web technologies and applications . 2016

机译：使用改进的编辑距离进行近乎重复的Web视频检索和本地化
5. Metacognitive knowledge development and language learning in the context of web-based distance language learning: A multiple-case study of adult EFL learners in China. [D] . Fincham, Naiyi Xie. 2015

机译：基于网络的远程语言学习背景下的元认知知识发展和语言学习：对中国成年英语学习者的多案例研究。
6. Efficient sequential and parallel algorithms for finding edit distance based motifs [O] . Soumitra Pal, Peng Xiao, Sanguthevar Rajasekaran 2016

机译：高效的顺序和并行算法用于查找基于编辑距离的图案
7. Efficient sequential and parallel algorithms for finding edit distance based motifs [O] . 2016

机译：高效的顺序和并行算法，用于查找基于编辑距离的图案

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

摘要

著录项

相似文献

相关主题

期刊订阅