...
首页> 外文期刊>Journal of software >The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance
【24h】

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

机译:基于编辑距离的中文重复网页检测算法

获取原文
获取原文并翻译 | 示例
           

摘要

On one hand, redundant pages could increase searching burden of the search engine. On the other hand, they would lower the user's experience. So it is necessary to deal with the pages. To achieve near-replicas detection, most of the algorithms depend on web page content extraction currently. But the cost of content extraction is large and it is difficult. What's more, it becomes much harder to extract web content properly. This paper addresses these issues through the following ways: it gets the definition of the largest number of common character by taking antisense concept of edit distance; it suggests that the feature string of web page built by a previous Chinese character of period in simple processing text; and it utilizes the largest number of common character to calculate the overlap factor between the feature strings of web page. As a consequence, this paper hopes to achieve near-replicas detection in high noise environment, avoiding extracting the content of web page. The algorithm is proven efficient in our experiment testing: the recall rate of web pages reaches 96.7%, and the precision rate reaches 97.8%.
机译:一方面,多余的页面可能会增加搜索引擎的搜索负担。另一方面,它们会降低用户的体验。因此有必要处理页面。为了实现近副本检测,目前大多数算法都依赖于网页内容提取。但是内容提取的成本很大,并且很困难。而且,正确提取Web内容变得更加困难。本文通过以下方式解决了这些问题:通过采用编辑距离的反义概念来获得最大数量的公共字符的定义;建议在简单处理文本时使用句号的前一个汉字建立的网页特征字符串;它利用最大数量的公共字符来计算网页特征串之间的重叠因子。因此,本文希望在高噪声环境下实现近副本检测,避免提取网页内容。该算法在我们的实验测试中被证明是有效的:网页的召回率达到96.7%,准确率达到97.8%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号