Recently, Hadoop Distributed File System (HDFS) has been widely used to manage the large-scale data due to its high scalability. HDFS can natively support sequential queries, which are the most common queries in the applications. However, there still exist many applications that need to apply random queries of large-scale data. So the random queries in large-scale data are becoming more and more important. Unfortunately, the HDFS is not optimized for random reads, hence there are many disadvantages in random access to HDFS. In this paper, we present three methods to solve these issues, which can optimize the random accesses to HDFS and guarantee the sequential access performance at the same time. The methods are as follows: 1) proposing dynamic methods to set the size of data packet in transmission, 2) reusing the TCP connections in localized random accesses, 3) transferring the random accesses to the same server to make full use of the TCP connections. Experimental evaluations based on real world data show that our works are effective and our solutions efficiently support sequential access and random access compared to the original methods.
展开▼