首页> 中文期刊>计算机科学 >基于大规模语料划分的频繁模式查找算法

基于大规模语料划分的频繁模式查找算法

     

摘要

频繁模式查找对新词识别、网络舆情监测、生物信息序列检测等领域有很高的应用价值.为处理规模远超出内存的语料,提出了一种实用的频繁模式查找算法.先将语料按后缀首字符划分为多个集合,通过逐条扫描集合数据,搜索出最大化最长公共前缀区间(MLCPI)来完成查找.另外在此基础上提出逐层归并算法,实现查找的同时归并子串.由于进行查找时无需将全部数据导入内存,因此资源消耗较少;各集合间频繁模式查找互不干扰,可采用并行处理加快运行速度.使用4.61G纯文本语料进行了试验,结果表明其内存消耗小于30M,查找速度最快达1.08M/s,能高效地进行子串归并.%Frequent patterns finding is useful for some areas,such as new word recognition,internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity,we put forward a pragmatic algorithm to find frequent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix,and then a concept of maximized longest common prefix interval (MLCPI) was introduced,and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm (HRA) to reduce sub-string during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover,it is found that frequent patterns a-mong sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1.08 megabytes per seconds,and it is able to reduce sub-string efficiently.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号