SCALABLE RESOURCE MANAGEMENT IN CLOUD COMPUTING
MetadataShow full item record
The exponential growth of data and application complexity has brought new challenges in the distributed computing field. Scientific applications are growing more diverse with various workloads, including traditional MPI high performance computing (HPC) to fine-grained loosely coupled many-task computing (MTC). Traditionally, these workloads have been shown to run well on supercomputers and highly-tuned HPC Clusters. The advent of Cloud computing has brought the attention of scientists to exploit these resources for scientific applications at a potentially lower cost. We investigate the nature of the cloud and its ability to run scientific applications efficiently. Delivering high throughput and low latency for the various types of workloads at large scales has driven us to design and implement new job scheduling and execution systems that are fully distributed and have the ability to run in public clouds. We discuss the design and implementation of a job scheduling and execution system (CloudKon). CloudKon is optimized to exploit the cloud resources efficiently through a variety of cloud services (Amazon SQS and DynamoDB) in order to get the best performance and utilization. It also supports various workloads including MTC and HPC applications concurrently. To further improve the performance and the flexibility of CloudKon, we designed and implemented a fully distributed message queue (Fabriq) that delivers an order of magnitude better performance than the Amazon Simple Queuing System (SQS). Designing Fabriq helped us expand our scheduling system to many other distributed system including non-Amazon clouds. Having Fabriq as a building block, we were able to design and implement a multipurpose task scheduling and execution framework (Albatross) that is able to efficiently run various types workloads at larger scales. Albatross provides data locality and task execution dependency. Those features enable Albatross to natively run MapReduce workloads. We evaluated CloudKon with synthetic MTC workloads, synthetic HPC workloads, and synthetic MapReduce applications on the Amazon AWS cloud with up to 1K instances. Fabriq was also evaluated with synthetic workloads on Amazon AWS cloud with up to 128 instances. Performance evaluations of Albatross show its ability to outperform Spark and Hadoop on different scenarios.