Amazon SQS the poor man's super computer

EC2 and S3 are not the only AWS services of interest to the geospatial community. Amazon SQS Simple Queue Service is also quite interesting. I haven’t looked into it too far but unlimited locking message queues with large instance arrays is essentially a poor man’s supercomputer. For a certain scale of problem which can be replicated recursively into multiple subsets, parallel computing techniques have often been used. Numerous distributed computing projects come to mind, Active Distributed Computing Projects.

Perhaps AWS can be configured for short burst supercomputer problems in an economical fashion. By breaking a problem into enough small chunks and adding them to a set of SQS queues pointed at a configurable array of ami instances, voila, we have an AWS super computer! The EC2 instance array would pull data chunks out of a queue, process , and queue back to an aggregator instance. An interesting problem might be to determine whether such a scenario would be queue constrained or processing instance constrained. Amazon resources are not infinite: “If you wish to run more than 20 instances, please contact us at ” However, let’s imagine a utility computing environment of the future.

In the AWS of the future an instance array can be more like Deep Blue. A modest 32×32 array provides 1024 discrete process instances which is possibly within current limits, but a more ambitious 256×256 array at 65536 distinct instances would not be out of the question on the five year horizon.

In the geospatial arena there are numerous problems amenable to distributed processing. With the massive collection of geospatial imagery presently underway, collection and storage are already a large problem for NASA, NOAA, JPL, USGS etc. Add to this problem the issue of scientific exploration of these massive data sets and distributed computing may have a large role to play within the same 5 year horizon.

This week OGC announced final release of the Web Processing Service, WPS. OGC WPS press release The Web Processing Service spec provides a blue print for services to ask higher level questions like why?, how much?, and what if? The goal is to provide interchangeable service process algorithms that can potentially be chained into answers to these types of higher level questions. For example a lidar scene can be processed into a roughness measure using a convolution kernel. When the result is compared with other bands from hyperspectral sensors in some boolean operation the output could be used to answer the question: “how many acres of drought tolerant grassland lie within Kit Carson county?” There are at least two distinct functions 1) roughness calculation 2) boolean combination, possibly a 3rd to add all pixels in the expected range for a final area measure.

Now add a distributed compute model. The simplest is one process per instance. In this approach each analysis request gets its own EC2 instance. All processes run sequentially in the single dedicated instance. This is of course a big help and far different than the typical multi-request one server model. But now we can move down this stream another step or two.

Next why not one instance for each process step. In this case a queue connects to a downstream instance. Process one performs the convolution and as chunks/cells/tiles become available they are pushed into the SQS. Process two, the boolean union, picks chunks from the other end of the queue to build the end result from a series of boolean tile operations. The queue decouples the two processes so that asynchronous operations are possible. If the first process proceeds at twice the speed of the second process simply add another instance to the other end of the queue. In this scenario we have one request, two WPS processes, and perhaps 3 AMI instances. This improves things a bit, actually quite a bit. The cost per request has at least tripled but throughput has also been increased by close to the same factor.

Now comes a full blown distributed model. Like most array objects geospatial processes can be broken into smaller subsets and the same process replicated over an array of subsets in a parallel fashion. Now each step in the process chain can have an array of instances each working on a small chunk. These chunks feed into multiple queues directed down stream to process two which is also an array of instances. We now have supercomputing potential. Process one 32×32 array pool of instances feeding some set of queues connecting to a second 32×32 array pool of instances working on process two. At 1024 instances per process we can quickly see the current AWS is not going to be happy. The cost is now magnified by a factor of a thousand but only if the instance pools are maintained continuously. If the pools are only in use for the duration of the request the cost could potentially be in the same magnitude as the one process per instance architecture, while throughput is increased by the 1000 factor. Short burst supercomputing inside utility computing warehouses like AWS could be quite cost effective.

It is conceivable that some analysis chains will involve dozens of process steps over very large imagery sets. Harnessing the ephemeral instance creation of utility computing points toward solutions to complex WPS process chains in near real time all on the internet cloud. So SQS does have some interesting potential in the geospatial analysis arena.

Xaml on Amazon EC2 S3

Time to experiment with Amazon EC2 and S3. This site is using an Amazon EC2 instance with a complete open source GIS stack running on Ubuntu Gutsy.

  • Ubuntu Gutsy
  • Java 1.6.0
  • Apache2
  • Tomcat 6.016
  • PHP5
  • MySQL 5.0.45
  • PostgreSQL 8.2.6
  • PostGIS 1.2.1
  • GEOS 2.2.3-CAPI-1.1.1
  • Proj 4.5.0
  • GeoServer 1.6

Running an Apache2 service with a jk_mod connector to tomcat lets me run the examples of xaml xbap files with their associated java servlet utilities for pulling up GetCapabilities trees on various OWS services. This is an interesting example of combining open source and WPF. In the NasaNeo example Java is used to create the 3D terrain models from JPL srtm (Ctrl+click) and drape with BMNG all served as WPF xaml to take advantage of native client bindings. NasaNeo example

I originally attempted to start with a public ami based on fedora core 6. I found loading the stack difficult with hard to find RPMs and difficult installation issues. I finally ran into a wall with the PostgreSQL/PostGIS install. In order to load I needed a complete gcc make package to compile from sources. It did not seem worth the trouble. At that point I switched to an Ubuntu 7.10 Gutsy ami.

Ubuntu based on debian is somewhat different in its directory layout from the fedora base. However, Ubuntu apt-get was much better maintained than the fedora core yum installs. This may be due to using the older fedora 6 rather than a fedora 8 or 9, but there did not appear to be any useable public ami images available on the AWS EC2 for the newer fedoras. In contrast to fedora on Ubuntu installing a recent version of PostgreSQL/PostGIS was a simple matter:
apt-get install postgresql-8.2-postgis postgis

In this case I was using the basic small 32 bit instance ami with 1.7Gb memory and 160Gb storage at $0.10/hour. The performance was very comparable to some dedicated servers we are running, perhaps even a bit better since the Ubuntu service is setup using an Apache2 jk_mod to tomcat while the dedicated servers simply use tomcat.

There are some issues to watch for on the small ami instances. The storage is 160Gb but the partition allots just 10Gb to root and the balance to a /mnt point. This means the default installations of mysql and postgresql will have data directories on the smaller 10Gb partition. Amazon has done this to limit ec2-bundle-vol to a 10GB max. ec2-bundle-volume is used to store an image to S3 which is where the whole utility computing gets interesting.

Once an ami stack has been installed it is bundled and stored on S3, that ami is then registered with AWS. Now you have the ability to replicate the image on as many instances as desired. This allows very fast scaling or failover with minimal effort. The only caveat of course is in dynamic data. Unless provision is made to replicate mysql and postgresql data to multiple instances or S3, any changes can be lost with the loss of an instance. This does not appear to occur terribly often but then again the AWS is still Beta. Also important to note, the DNS domain pointed to an existing instance will also be lost with the loss of your instance. Bringing up a new instance requires a change to the DNS entry as well (several hours), since each instance creates its own unique amazon domain name. There appear to be some work arounds for this requiring more extensive knowledge of DNS servers.

In my case the data sources are fairly static. I ended up changing the datadir pointers to /mnt locations. Since these are not bundled in the volume creation, I handled them separately. Once the data required was loaded I ran a tar on the /mnt/directory and copied the .tar files each to its own S3 bucket. The files are quite large so this is not a nice way to treat backups of dynamic data resources.

Next week I have a chance to experiment with a more comprehensive solution from Elastra. Their beta version promises to solve these issues by wrapping Postgresql/postgis on the ec2 instance with a layer that uses S3 as the actual datadir. I am curious how this is done but assume for performance the indices remain local to an instance while the data resides on S3. I will be interested to see what performance is possible with this product.

Another interesting area to explore is Amazon’s recently introduced SimpleDB. This is not a standard sql database but a type of hierarchical object stack over on S3 that can be queried from EC2 instances. This is geared toward non typed text storage which is fairly common in website building. It will be interesting to adapt this to geospatial data to see what can be done. One idea is to store bounding box attributes in the SimpleDB and create some type of JTS tool for indexing on the ec2 instance. The local spatial index would handle the lookup which is then fed to the SimpleDB query tools for retrieving data. I imagine the biggest bottleneck in this scenario would be the cost of text conversion to double and its inverse.

Utility computing has an exciting future in the geospatial realm – thank you Amazon and Zen.