首页> 中文期刊> 《计算机工程》 >大规模语料中频繁模式增量发现算法

大规模语料中频繁模式增量发现算法

         

摘要

This paper presents a memory-based frequent pattern incremental discovering algorithm for large-scale corpus. It extracts strings and counts frequencies of them from local area, prunes the local relative low frequency strings, and uses multi-mode string matching algorithm to count the local relative high frequency strings in the whole corpus, eventually gets the frequent patterns that the frequency is greater than the threshold. Experimental result shows that the algorithm has a better space complexity and the highest consumption of the memory size in the process of frequent-pattern discovery is about 20% to the size of the algorithm based on suffix array.%提出一种适用于大规模语料的频繁模式增量发现算法.统计局部区域提取的字符串频度,对局部相对低频字符串进行剪枝.利用多模式串匹配算法,统计剪枝后局部相对高频字符串在整个语料中的频度,得到频度大于阈值的频繁模式.实验结果表明,该算法具有较低的空间复杂度和时间复杂度,内存消耗为基于后缀数组的频繁模式发现算法的20%左右.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号