LHF: A New Archive Based Approach to Accelerate Massive Small Files Access Performance in HDFS

机译：LHF：一种基于存档的新方法，可加快HDFS中大量小文件的访问性能

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

机译：作为最受欢迎的开源项目之一，Hadoop被视为当今用于管理和分析大量数据的实际框架。 HDFS（Hadoop分布式文件系统）是Hadoop框架中用于存储大数据（尤其是半结构化和非结构化数据）的核心组件之一。当在数千台计算机上处理大型文件时，HDFS提供了高可伸缩性和可靠性。但是，在处理大量小文件时，性能将严重下降。尽管花费了一些精力来研究这个众所周知的问题，但是现有的方法（例如HAR，SequenceFile和MapFile）在减少NameNode的内存消耗和优化访问性能的能力上受到限制。在本文中，我们介绍了LHF，该解决方案通过将小文件合并为大文件并构建基于线性哈希的可扩展索引来处理HDFS中的大文件，从而加快了查找小文件的过程。我们方法的优点是（1）它显着减小了元数据的大小;（2）它不需要在客户端对文件进行排序;（3）它支持随后将更多的小文件附加到合并的文件中;（4）），可实现良好的访问性能。进行了一系列实验以证明LHF的有效性和效率，与其他方法相比，访问文件所需的时间更少。

著录项

来源
《IEEE International Conference on Big Data Computing Service and Applications》|2019年|40-48|共9页
会议地点
作者
Wenjun Tao; Yanlong Zhai; Jude Tchaye-Kondi;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Metadata; File systems; Indexes; Big Data; Cloud computing; Scalability; Memory management;

机译：元数据;文件系统;索引;大数据;云计算;可伸缩性;内存管理;

相似文献

外文文献
中文文献
专利

1. An efficient distributed caching for accessing small files in HDFS [J] . Kyoungsoo Bok, Hyunkyo Oh, Jongtae Lim, Cluster computing . 2017,第4期

机译：一种有效的分布式缓存，用于访问HDFS中的小文件
2. Accessing medical image file with co-allocation HDFS in cloud [J] . Chao-Tung Yang, Wen-Chung Shih, Lung-Teng Chen, Future generation computer systems . 2015,第feba期

机译：通过云中的共同分配HDFS访问医学图像文件
3. Enhancing HDFS with a full-text search system for massive small files [J] . Xu Wentao, Zhao Xin, Lao Bin, Journal of supercomputing . 2021,第7期

机译：使用全文搜索系统增强HDF，用于大量小文件
4. LHF: A New Archive Based Approach to Accelerate Massive Small Files Access Performance in HDFS [C] . Wenjun Tao, Yanlong Zhai, Jude Tchaye-Kondi IEEE International Conference on Big Data Computing Service and Applications . 2019

机译：LHF：基于新的存档方法，以加速HDFS中的大量小文件访问性能
5. A discourse on shadows: Archive ideals and ideal archives. How access and preservation shape the performance of archival discourse. [D] . Yim, Matthew Lonoikamakahiki Wah Tim. 2007

机译：关于阴影的论述：存档理想和理想存档。获取和保存如何影响档案话语的表现。
6. Photon-HDF5: An Open File Format for Timestamp-Based Single-Molecule Fluorescence Experiments [O] . Antonino Ingargiola, Ted Laurence, Robert Boutelle, 2016

机译：Photon-HDF5：用于基于时间戳的单分子荧光实验的开放文件格式
7. An optimization strategy of massive small files storage based on HDFS [O] . Xun Cai, Cai Chen, Yi Liang 2018

机译：基于HDFS的大规模小文件存储优化策略

LHF: A New Archive Based Approach to Accelerate Massive Small Files Access Performance in HDFS

摘要

著录项

相似文献

相关主题

期刊订阅