首页> 外文会议>2014 3rd International Conference on Parallel Distributed and Grid Computing >Distributed pattern matching and document analysis in big data using Hadoop MapReduce model
【24h】

Distributed pattern matching and document analysis in big data using Hadoop MapReduce model

机译:使用Hadoop MapReduce模型在大数据中进行分布式模式匹配和文档分析

获取原文
获取原文并翻译 | 示例

摘要

Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. MapReduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of MapReduce model primarily based on time requirements.
机译:顺序模式挖掘和文档分析是在大数据中具有广泛应用的重要数据挖掘问题。本文研究了在数据集模式匹配和文档分析环境中管理分布式处理的特定框架。 Hadoop集群上的MapReduce编程模型具有高度可扩展性,可与具有集成机制的容错机制的商品机器一起使用。在本文中,我们借助于Hadoop分布式文件系统提出了一种基于Knuth Morris Pratt的分布式环境中基于顺序模式匹配的有效模式,以进行顺序模式的挖掘。它还研究了对文本文档数据集进行分区和聚类以进行文档比较的可行性。它简化了搜索空间并获得了更高的挖掘效率。数据挖掘任务已分解为许多Map任务,并已分发给许多Task跟踪器。地图任务会找到中间结果,然后发送以归约任务来合并最终结果。理论分析和带有数据的实验结果以及大小不同的聚类都表明,MapReduce模型的有效性主要基于时间要求。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号