Recently I had a few discussions with people looking at leveraging the cloud. They were looking at extending their own compute farm by establishing a VPN to a public cloud and borrowing computing resources as needed.
With a compute farm, you have a fixed amount of computing resources. You rely on an engine like LSF (Load Sharing Facility) or SGE (Sun Grid Engine) to schedule and prioritize jobs to best use these fixed computing resources. The rule of the game is to keep the queue as short as possible, or better, to keep the relative processing time increase as small as possible.
Clearly having access to hundreds of compute nodes (or instances, to use the cloud terminology) on demand changes the game of parallel computing entirely. In the word of cloud computing, using 100 instances for 1 hour costs the same as using 1 instance for 100 hours (a bit more than 4 days). Assuming you can borrow as much as you want and that you can keep all the instances busy, there is no point in limiting the computing resources: the cost will be the same, but the wall time will be reduced.
Of course, this is more complicated in practice. You are billed by the hour. Thus starting a new instance to process jobs is wasteful if that instance is not fully utilized during a whole hour. Also starting or shutting down an instance takes a few minutes, during which that instance is unavailable. Thus you must be able to anticipate the upcoming job distribution and have a clear pictures of the instances’ load and how long they have before their hour expire to decide whether you should:
- Start a new instance to process jobs;
- Shut down an instance before being charged another hour;
- Queue a job.
You also want to account for hardware failures, which is more likely to happen if you have many instances for a long time. Also inter-instance communication, if needed, can become a major bottleneck in scaling up –unless there is a 10Gb network available.
The cloud has an appealing message –borrow when you need it. It is a paradigm shift for parallel computing. This means moving from “managing resources” in a compute farm to “managing cost” in the cloud. As explained above, managing cost in the cloud is substantially more complicated that scheduling jobs on a compute farm. Yet cloud computing gives the flexibility to design strategies that reduces wall time and keeps costs low. It will only benefit customers: borrow more for less time, pay the same, and get the result faster.
Tags: cloud computing