首页> 外文期刊>Journal of The Institution of Engineers (India): Series B >GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents
【24h】

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

机译:基于GPU的带分数表方法的N-Gram字符串匹配算法,用于在许多文档中进行字符串搜索

获取原文
获取原文并翻译 | 示例
       

摘要

String searching in documents has become a tedious task with the evolution of Big Data. Generation of large data sets demand for a high performance search algorithm in areas such as text mining, information retrieval and many others. The popularity of GPU’s for general purpose computing has been increasing for various applications. Therefore it is of great interest to exploit the thread feature of a GPU to provide a high performance search algorithm. This paper proposes an optimized new approach to N-gram model for string search in a number of lengthy documents and its GPU implementation. The algorithm exploits GPGPUs for searching strings in many documents employing character level N-gram matching with parallel Score Table approach and search using CUDA API. The new approach of Score table used for frequency storage of N-grams in a document, makes the search independent of the document’s length and allows faster access to the frequency values, thus decreasing the search complexity. The extensive thread feature in a GPU has been exploited to enable parallel pre-processing of trigrams in a document for Score Table creation and parallel search in huge number of documents, thus speeding up the whole search process even for a large pattern size. Experiments were carried out for many documents of varied length and search strings from the standard Lorem Ipsum text on NVIDIA’s GeForce GT 540M GPU with 96 cores. Results prove that the parallel approach for Score Table creation and searching gives a good speed up than the same approach executed serially.
机译:随着大数据的发展,在文档中进行字符串搜索已成为一项繁琐的任务。大数据集的产生要求在文本挖掘,信息检索等许多领域需要高性能的搜索算法。对于各种应用,用于通用计算的GPU的普及程度正在不断提高。因此,利用GPU的线程功能来提供高性能的搜索算法非常重要。本文提出了一种优化的N-gram模型新方法,用于在许多冗长的文档中进行字符串搜索及其GPU实现。该算法利用GPGPU在许多文档中使用字符级N-gram匹配和并行计分表方法搜索字符串,并使用CUDA API进行搜索。分数表的新方法用于在文档中以N-gram的频率存储,使搜索独立于文档的长度,并允许更快地访问频率值,从而降低了搜索复杂度。利用GPU中广泛的线程功能,可以对文档中的三字母组进行并行预处理,以创建分数表并在大量文档中进行并行搜索,从而即使在模式较大的情况下也可以加快整个搜索过程。在NVIDIA具有96个核心的GeForce GT 540M GPU上,对来自Lorem Ipsum标准文本的各种长度不同的文档和搜索字符串进行了实验。结果证明,与串行执行相同的方法相比,并行创建和查找表的方法可以提供更快的速度。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号