首页> 中文期刊>计算机技术与发展 >基于Simhash算法的海量文档反作弊技术研究

基于Simhash算法的海量文档反作弊技术研究

     

摘要

On the background of the anti-spamming needs of repeated documents in Internet,research the anti-spamming technique based on the Simhash on huge amounts of documents. On the basis of taking the Simhash algorithm as core algorithm in duplicate document de-tection,improve the procedure of achieving document features of this algorithm. It takes the meaning of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit,provide the document service of user dimension,the full di-mension and black dimension,and make a similarity comparison based on the full text and paragraphs. Through test data and analysis,this technique can guarantee the stable operation,100 million documents can be memorized in each example. The average request response time is about 20 ms. The response time will increase during the peak hour,but,in general,will not go over 100 ms.%以互联网重复文档反作弊需求为背景,研究了基于Simhash的海量文档反作弊技术。以Simhash算法为文档判重的核心算法作基础对该算法获取文档特征的过程进行改进,将单词意义作为衡量单词权重的一个考量因素。针对64位文档Simhash签名,提供用户维度、全文维度和黑库维度的文档判重服务,并可基于全文和段落两种粒度进行文档相似性比较。通过测试数据和分析,该技术能保证运行稳定,每个实例可存储1亿文档,平均请求耗时稳定在20 ms左右,高峰期请求耗时会增长,但一般不会超过100 ms。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号