【24h】

Highly Scalable Text Mining - Parallel Tagging Application

机译:高度可扩展的文本挖掘 - 并行标记应用程序

获取原文

摘要

There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle exponential growth in text data. Problem sizes are increasing by the day by addition of new text docments. The task of labelling sequence data such as part-of-speech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. Genia is a POS tagger which is specifically tuned for biomedical text. Genia is built with maximum entropy modelling and state of the art tagging algorithm. A Parallel version of genia tagger application has been implemented and performance has been compared on a number of different architectures. The focus has been particularly on scalability of the application. Scaling of 512 processors has been achieved and a method to scale to 10000 processors is proposed for massively parallel Text Mining applications. The parallel implementation of genia tagger is done using MPI for achieving portable code.
机译:迫切需要使用高性能计算(HPC)和网格环境开发新的文本挖掘解决方案,以解决文本数据中的指数增长。通过添加新的文本文本,问题尺寸在日趋增加。标记序列数据的任务如语音部分(POS)标记,块(浅析解析)和命名实体识别是文本挖掘中最重要的任务之一。 Genia是一个专门调整生物医学文本的POS标记器。最大熵建模和艺术标记算法的最大熵建筑和最大的Genia。已经实现了Penia标签应用程序的并行版本,并在许多不同的架构上进行了性能。重点特别是应用程序的可扩展性。已经实现了512个处理器的缩放,并提出了一种用于缩放到10000处理器的方法,用于大规模并行文本挖掘应用程序。使用MPI来实现Genia标签的并行实现,以实现便携式代码。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号