Stateless clusters such as Hadoop clusters are widely deployed to drive the business data analysis. When a cluster needs to be restarted for cluster-wide maintenance, it is desired for the administrators to choose a maintenance window that results in: (1) least disturbance to the cluster operation; and (2) maximized job processing throughput. A straightforward but naive approach is to choose maintenance time that has the least number of running jobs, but such an approach is suboptimal. In this work, we use Hadoop as an use case and propose to determine the optimal cluster maintenance time based on the accumulated job progress, as opposed the number of running jobs. The approach can maximize the job throughput of a stateless cluster by minimizing the amount of lost works due to maintenance. Compared to the straightforward approach, the proposed approach can save up to 50% of wasted cluster resources caused by maintenance according to production cluster traces.
展开▼