首页> 外文会议>International Conference on Circuits, Power and Computing Technologies >Locality Sensitive Hashing based Incremental Clustering for Creating Affinity Groups in Hadoop - HDFS -An Infrastructure Extension
【24h】

Locality Sensitive Hashing based Incremental Clustering for Creating Affinity Groups in Hadoop - HDFS -An Infrastructure Extension

机译:基于局部敏感的散列增量聚类,用于在Hadoop - HDFS中创建关联组 - anix基础架构扩展

获取原文

摘要

Apache's Hadoop is an open source framework for large scale data analysis and storage. It is an open source implementation of Google's Map/Reduce framework. It enables distributed, data intensive and parallel applications by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each,task processes a different partition in parallel. Hadoop uses Hadoop distributed File System (HDFS) which is an open source implementation of the Google File System (GFS) for storing data. Map/Reduce application mainly uses HDFS for storing data. HDFS is a very large distributed file system that assumes commodity hardware and provides high throughput and fault tolerance. HDFS stores files as a series of blocks and are replicated for fault tolerance. The default, block placement strategy doesn't consider the data characteristics and places the data blocks randomly. Customized strategies can improve the performance of HDFS to a great extend. Applications using HDFS require streaming access to the files and if the related files are placed in the same set of data nodes, the performance can be increased. This paper is discussing about a method for clustering streaming data to the same set of data nodes using the technique of Locality Sensitive Hashing. The method utilizes the compact bitwise representation of.document vectors called fingerprints created using the concept of Locality Sensitive Hashing to increase the data processing speed and performance. The process will be done without affecting the default fault tolerant properties of Hadoop and requires only minimal changes to the Hadoop framework.
机译:Apache的Hadoop是一个用于大规模数据分析和存储的开源框架。它是谷歌地图/减少框架的开源实现。它通过将大规模作业分解成较小的任务和将大规模数据集成到较小的分区中,使其能够分布,数据密集和并行应用程序,使得每个任务并行处理不同的分区。 Hadoop使用Hadoop分布式文件系统(HDF),它是用于存储数据的Google文件系统(GFS)的开源实现。地图/减少应用程序主要使用HDFS来存储数据。 HDFS是一个非常大的分布式文件系统,假设商品硬件并提供高吞吐量和容错。 HDFS将文件存储为一系列块,并复制以进行容错。默认情况下,块放置策略不考虑数据特征并随机地放置数据块。定制策略可以提高HDFS对大延伸的性能。使用HDFS的应用程序需要流传输对文件的访问,并且如果相关文件放置在同一组数据节点中,则可以增加性能。本文正在讨论使用局部敏感散列技术将流数据集聚到相同的数据节点集的方法。该方法利用了使用当地敏感散列概念创建的Document矢量的紧凑位表示,以增加数据处理速度和性能。将完成该过程而不会影响Hadoop的默认容错属性,并且只需要对Hadoop框架的最小变化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号