EC2 and S3 are not the only AWS services of interest to the geospatial community. Amazon SQS Simple Queue Service is also quite interesting. I haven’t looked into it too far but unlimited locking message queues with large instance arrays is essentially a poor man’s supercomputer. For a certain scale of problem which can be replicated recursively into multiple subsets, parallel computing techniques have often been used. Numerous distributed computing projects come to mind, Active Distributed Computing Projects.
Perhaps AWS can be configured for short burst supercomputer problems in an economical fashion. By breaking a problem into enough small chunks and adding them to a set of SQS queues pointed at a configurable array of ami instances, voila, we have an AWS super computer! The EC2 instance array would pull data chunks out of a queue, process , and queue back to an aggregator instance. An interesting problem might be to determine whether such a scenario would be queue constrained or processing instance constrained. Amazon resources are not infinite: “If you wish to run more than 20 instances, please contact us at aws@amazon.com ” However, let’s imagine a utility computing environment of the future.
In the AWS of the future an instance array can be more like Deep Blue. A modest 32×32 array provides 1024 discrete process instances which is possibly within current limits, but a more ambitious 256×256 array at 65536 distinct instances would not be out of the question on the five year horizon.
In the geospatial arena there are numerous problems amenable to distributed processing. With the massive collection of geospatial imagery presently underway, collection and storage are already a large problem for NASA, NOAA, JPL, USGS etc. Add to this problem the issue of scientific exploration of these massive data sets and distributed computing may have a large role to play within the same 5 year horizon.
This week OGC announced final release of the Web Processing Service, WPS. OGC WPS press release The Web Processing Service spec provides a blue print for services to ask higher level questions like why?, how much?, and what if? The goal is to provide interchangeable service process algorithms that can potentially be chained into answers to these types of higher level questions. For example a lidar scene can be processed into a roughness measure using a convolution kernel. When the result is compared with other bands from hyperspectral sensors in some boolean operation the output could be used to answer the question: “how many acres of drought tolerant grassland lie within Kit Carson county?” There are at least two distinct functions 1) roughness calculation 2) boolean combination, possibly a 3rd to add all pixels in the expected range for a final area measure.
Now add a distributed compute model. The simplest is one process per instance. In this approach each analysis request gets its own EC2 instance. All processes run sequentially in the single dedicated instance. This is of course a big help and far different than the typical multi-request one server model. But now we can move down this stream another step or two.
Next why not one instance for each process step. In this case a queue connects to a downstream instance. Process one performs the convolution and as chunks/cells/tiles become available they are pushed into the SQS. Process two, the boolean union, picks chunks from the other end of the queue to build the end result from a series of boolean tile operations. The queue decouples the two processes so that asynchronous operations are possible. If the first process proceeds at twice the speed of the second process simply add another instance to the other end of the queue. In this scenario we have one request, two WPS processes, and perhaps 3 AMI instances. This improves things a bit, actually quite a bit. The cost per request has at least tripled but throughput has also been increased by close to the same factor.
Now comes a full blown distributed model. Like most array objects geospatial processes can be broken into smaller subsets and the same process replicated over an array of subsets in a parallel fashion. Now each step in the process chain can have an array of instances each working on a small chunk. These chunks feed into multiple queues directed down stream to process two which is also an array of instances. We now have supercomputing potential. Process one 32×32 array pool of instances feeding some set of queues connecting to a second 32×32 array pool of instances working on process two. At 1024 instances per process we can quickly see the current AWS is not going to be happy. The cost is now magnified by a factor of a thousand but only if the instance pools are maintained continuously. If the pools are only in use for the duration of the request the cost could potentially be in the same magnitude as the one process per instance architecture, while throughput is increased by the 1000 factor. Short burst supercomputing inside utility computing warehouses like AWS could be quite cost effective.
It is conceivable that some analysis chains will involve dozens of process steps over very large imagery sets. Harnessing the ephemeral instance creation of utility computing points toward solutions to complex WPS process chains in near real time all on the internet cloud. So SQS does have some interesting potential in the geospatial analysis arena.
