基于大规模语料划分的频繁模式查找算法

丁溪源; 黄河燕; 张海军; 王树梅

首页> 中文期刊>计算机科学 >基于大规模语料划分的频繁模式查找算法

基于大规模语料划分的频繁模式查找算法

开具论文收录证明 >>

期刊封面封底目录下载 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

频繁模式查找对新词识别、网络舆情监测、生物信息序列检测等领域有很高的应用价值.为处理规模远超出内存的语料,提出了一种实用的频繁模式查找算法.先将语料按后缀首字符划分为多个集合,通过逐条扫描集合数据,搜索出最大化最长公共前缀区间(MLCPI)来完成查找.另外在此基础上提出逐层归并算法,实现查找的同时归并子串.由于进行查找时无需将全部数据导入内存,因此资源消耗较少；各集合间频繁模式查找互不干扰,可采用并行处理加快运行速度.使用4.61G纯文本语料进行了试验,结果表明其内存消耗小于30M,查找速度最快达1.08M/s,能高效地进行子串归并.%Frequent patterns finding is useful for some areas,such as new word recognition,internet public opinion monitoring, bio-information series detection, etc. Considering that corpus size is much larger than memory capacity,we put forward a pragmatic algorithm to find frequent patterns. Firstly, corpus was partitioned into multiple sets based on first character of suffix,and then a concept of maximized longest common prefix interval (MLCPI) was introduced,and by means of searching it while scanning data in sets item by item, we accomplished the finding task. Besides, we proposed hierarchical reduction algorithm (HRA) to reduce sub-string during the finding process on that basis. There is no need to import all data into memory, so it will decrease resource consumption. Moreover,it is found that frequent patterns a-mong sets do not interfere with each other, which will improve the speed while processing paralleled. We used 4. 61 gigabytes plain text as experiment data. The results show that the memory usage is lower than 30 megabytes, and the speed is up to 1.08 megabytes per seconds,and it is able to reduce sub-string efficiently.

著录项

来源
《计算机科学》|2012年第3期|149-152,169|共5页
作者
丁溪源; 黄河燕; 张海军; 王树梅;
展开▼
作者单位

南京理工大学计算机科学与技术学院南京210094;

中国科学院计算机语言信息工程研究中心北京100097;

北京理工大学计算机科学技术学院北京100081;

中国科学院计算机语言信息工程研究中心北京100097;

南京理工大学计算机科学与技术学院南京210094;

展开▼
原文格式 PDF
正文语种 chi
中图分类信息处理（信息加工）;
关键词
频繁模式; 重复串; 语料划分; 子串归并;
入库时间 2022-08-18 04:37:53

相似文献

中文文献
外文文献
专利

1. 大规模语料中频繁模式增量发现算法 [J] . 廖豪 ,陈洁 ,谭建龙 . 计算机工程 . 2011,第023期
2. 大规模语料的频繁模式快速发现算法 [J] . 龚才春 ,贺敏 ,陈海强 . 通信学报 . 2007,第012期
3. 基于大规模语料的汉语教学词表更新研究——以《汉语国际教育用音节汉字词汇等级划分》名词为例 [J] . 王治敏 ,俞士汶 . 辞书研究 . 2019,第005期
4. 一种基于多重索引的大规模数据快速查找算法 [J] . 应俊 ,杨茂斌 . 计算机科学 . 2009,第003期
5. 基于空间划分的频繁模式挖掘算法 [J] . 王国光 ,刘铁英 ,王鑫 . 内蒙古大学学报：自然科学版 . 2007,第3期
6. 大规模语料的频繁模式快速发现算法 [C] . 龚才春 ,中国科学院研究生院 ,贺敏 . 2007年全国网络与信息安全技术研讨会 . 2007
7. 基于频繁模式挖掘的双植入位点查找算法研究 [A] . 杨琳琳 . 2009

基于大规模语料划分的频繁模式查找算法

摘要

著录项

相似文献

相关主题

期刊订阅