首页> 外文会议>International Conference on Field Programmable Logic and Applications >Token-based dictionary pattern matching for text analytics
【24h】

Token-based dictionary pattern matching for text analytics

机译:用于文本分析的基于令牌的字典模式匹配

获取原文
获取外文期刊封面目录资料

摘要

When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s.
机译:当对非结构化文本数据执行文本分析查询时,正则表达式和字典匹配花费了大量处理时间。在本文中,我们提出了一种用于令牌绑定模式匹配的可编译体系结构,并支持令牌模式序列检测。提出的体系结构能够检测数百个字典,每个字典都以高吞吐量包含数千个元素。可编程状态机用作模式检测引擎,以实现确定性性能,同时保持较低的存储要求。为了检测令牌序列,基于非确定性自动机来编译专用电路。级联结果查找可确保有效存储,同时允许检测多令牌元素并报告多个字典命中。我们在Altera Stratix IV GX530上实施,并且能够以9.7 Gb / s的峰值吞吐率并行处理多达16个文档。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号