...
首页> 外文期刊>Journal of software >The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance
【24h】

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

机译:基于编辑距离的中文重复网页检测算法

获取原文
           

摘要

On one hand, redundant pages could increasesearching burden of the search engine. On the other hand,they would lower the user’s experience. So it is necessary todeal with the pages. To achieve near-replicas detection, mostof the algorithms depend on web page content extractioncurrently. But the cost of content extraction is large and it isdifficult. What’s more, it becomes much harder to extractweb content properly. This paper addresses these issuesthrough the following ways: it gets the definition of thelargest number of common character by taking antisenseconcept of edit distance; it suggests that the feature string ofweb page built by a previous Chinese character of period insimple processing text; and it utilizes the largest number ofcommon character to calculate the overlap factor betweenthe feature strings of web page. As a consequence, thispaper hopes to achieve near-replicas detection in high noiseenvironment, avoiding extracting the content of web page.The algorithm is proven efficient in our experiment testing:the recall rate of web pages reaches 96.7%, and theprecision rate reaches 97.8%.
机译:一方面,多余的页面会增加搜索引擎的搜索负担。另一方面,它们会降低用户的体验。因此有必要处理页面。为了实现近副本检测,目前大多数算法都依赖于网页内容的提取。但是,内容提取的成本高且难度大。而且,正确提取网络内容变得更加困难。本文通过以下方式解决了这些问题:通过编辑距离的反义概念得到最大数量的公共字符的定义;提示以句号前一个汉字建立的网页特征字符串处理文本不简单;它利用最大数量的公共字符来计算网页特征串之间的重叠因子。因此,本文希望在高噪声环境下实现近副本检测,避免提取网页内容。该算法在实验测试中被证明是有效的:网页的查全率达到96.7%,准确率达到97.8%。 。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号