The increasing computation and data requirements of scientific applications, especially in the areas of bioinformatics, astronomy, high energy physics, and earth sciences, have necessitated the use of distributed resources owned by collaborating parties. While existing distributed systems work well for compute-intensive applications that require limited data movement, they fail in unexpected ways when the application accesses, creates, and moves large amounts of data over wide-area networks. Existing systems closely couple data movement and computation, and consider data movement as a side effect of computation. In this chapter, we propose a framework that de-couples data movement from computation, allows queuing and scheduling of data movement apart from computation, and acts as an I/O subsystem for distributed systems. This system provides a uniform interface to heterogeneous storage systems and data transfer protocols; permits policy support and higher-level optimization; and enables reliable, efficient scheduling of compute and data resources.
展开▼