weogeo and web2.0

A recent article on All Points Blog pointed me to a new web2.0 GIS data/community site called weogeo, still in Beta. Apart from the “whimsical” web2.0 type name the site developers are trying to leverage newer web2.0 capabilities for a data resource community.


Fig.1 weogeo slider workflow interface

The long standing standard for data resources has been geocomm, with its large library of useful and free data downloads. Geocom was built probably a decade ago on a unix ftp type architecture. Over the years it has evolved into a full featured news, data, software community site by keeping focused on low cost immediate downloads with an ad revenue business model enhanced with some paid premiere services. It has been successful, but apart from the revenue model, definitely archaic in todays online world.

Curious about the potential, I decided to look at the beta weogeo in detail. I was especially attracted to its use of Amazons Web Services, AWS:

S3 Simple Storage Service
EC2 Elasctic Compute Cloud
SQS Simple Queue Service

Amazon’s Web Services are a new approach to commodity web services that provide services on a lower cost time rate. This approach is called utility computing because of its mapping to the utility industry pricing models. For example S3 services on Amazon infrastructure are available for the following:


Storage

$0.15 per GB-Month of storage used
Data Transfer
$0.10 per GB - all data transfer in
 $0.18 per GB - first 10 TB / month data transfer out
 $0.16 per GB - next 40 TB / month data transfer out
 $0.13 per GB - data transfer out / month over 50 TB
Data transfer "in" and "out" refers to transfer into and out of Amazon S3.
Data transferred between Amazon S3 and Amazon EC2 is free of charge
Requests
$0.01 per 1,000 PUT or LIST requests
$0.01 per 10,000 GET and all other requests


This can translate into a convenient offsite storage with costs linked to use.
For example a 250Gb data store would be charged a $25 upload transfer fee and thereafter a $37.50/month storage fee. Storage fees would slide up or down as the use varies so you never pay for more than needed at a specific time.


S3 is not especially novel, but EC2 is, moving from storage to virtual computing resources. Now a virtual computer is set up on an as needed basis and torn down when not needed. EC2 is competitively priced at 0.10 per instance hour. This is especially attractive to small businesses and startups in the online world whose computing resources may fluctuate wildly. Ramping up to a highest expected load can require costly investments in dedicated servers at a managed hosting firm with little flexibility. Amazons SQS is a similar pricing model for queued message services on demand.


Weogeo provides a data service built on top of EC2 and S3. Their goal is to attract numerous data producers/sellers to provide a community of data sources with a flexible pricing model sitting on top of Amazons web services.

The weogeo founders are coming from a large imagery background with years of experience in scientific hyperspectral imagery which produces very large images, in the 40Gb/image range. The distribution of large image files poses a serious strain on bandwidth, but also on backend cpu. This is especially true of an environment which hopes to provide flexible export formats and projections. Community sellers are encouraged to build a data resource which can then be provided to the community on a cost basis using Amazon S3 for storage and EC2 for cpu requirements.

To make this feasible the weogeo development team also came up with an EC2 specific management tool called weoceo, which automates the set up and tear down of virtual computers. For example a request for a large customized data set can trigger a dedicated server instance specifically for a single download and then remove the instance once completed. As a side benefit weoceo adds a service level guarantee by insuring that lost compute instances can be automatically replaced as an effective failover technique. This failover still requires minutes rather than seconds, but the trend to virtualization points to an eventual future with nearly instant failover instance creation.

This service model was evidently used internally by FERI for the large imagery downloads generated by research scientists and academic communities. The use of open source gdal on EC2 is an effective way to generalize this model and reduce delivery costs for large data resource objects. The developers at weogeo are also extending this distribution model for vector resources.

To try this out I created a few WPF terrain models over Rocky Mountain National Park for entry into the weogeo library. These are vector models which include the necessary model mesh as well as overlay image and a set of binding controls to manipulate camera view, elevation scale, scene rotation, and tilt. Admittedly there is a small audience for such data sources, but they do require a relatively significant file size for vector data, on the order of 20Mb per USGS quad or 4.5Mb compressed.


Fig 2 Fall River Pass quad WPF 3D terrain model


In addition to the actual resource I needed to create a jpg overview image with a jpeg world file and a description html page. Weogeo Beta is currently waiving setup costs on a trial basis.The setup is fairly straightforward and can be streamlined for large sets by also creating an xml configuration file for populating the weogeo catalog. Once entered, weogeo will list the resource in its catalog based on coverage. The seller is allowed considerable flexibility in description and pricing.

The user experience is quite beautiful. It appears to be a very nice Ruby Rails design and seems to fit well in the newer web2.0 world. Those familiar with iPhone slide interfaces will be up and running immediately. The rest of us can appreciate the interface beauty, but will need some amount of experimentation to get used to the workflow. Workflow proceeds in a set of slide views that query the catalog with a two stage map interface, then provide lists of available resources with clickable sort functions.


Fig.3 weogeo slider maps selection interface

Once a resource is selected the workflow slides to a thumbnail view with a brief description. Clickable access is provided to more detailed descriptions and a kml view of the thumbnail.The next step in the workflow is a customization slide. This is still only available for imagery but includes things like crop, export format selection, and projection, which are cost selections provided by weogeo rather than the seller. Finally an order slide collects payment details and an order is created.


Fig.4 weogeo slider customization interface

At this point the web2.0 model breaks down and weogeo resorts to old web architecture by sending an email link to the user. The url will either provide a temporary download access to an EC2 resource setup specifically for the data or direct the user to the sellers fulfillment service. In either case the immediate fulfillment expectation of web2.0 is not met. This is understandable for very large imagery object sizes but becomes a user drawback for typical vector data downloads.This is especially true of users familiar with OWS services that provide view and data simultaneously.

Since I believe the weogeo interface is effective (perhaps I’m a pushover for beauty), I decided to look deeper into the Amazon services on which it’s based. Especially interesting to a small business owner like myself, is the low cost flexibility of EC2. In order to use these services its necessary to first setup an account and receive access public/private keys for the security system. An X509Certificate is also setup which is used for SSH access to particular EC2 instances. Each instance provides a virtual 1.7Ghz x86 processor, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network bandwidth with some Linux variant OS.

Amazon provides both REST and SOAP interface models. REST appears the better choice for S3 since it’s easier to set up data streams for larger files, while SOAP is by default memory bound. Access libraries are provided for Java, Perl, PHP, C#, Python, and Ruby which seems to cover all the bases. I experimented with some of the sample Java code and was quickly up and running making simple REST requests for bucket creation, object uploading, deletion etc. It is, however, much easier to populate S3 space using the opensource library called JetS3t, which contains a tool called Cockpit.

EC2 requires access to S3 buckets. Amazon provides a few preconfigured Amazon Machine Image files, AMIs, but they are meant as starting places for building your own custom AMI. Once an AMI is built it is saved to an S3 account bucket where it is accessed when building EC2 instances. I experimented with building an AMI and managed to get a basic instance going, customized, and saved without too much trouble following the “getting started” instructions. I can get around a little in basic Linux, but setting up a useable AMI will require more than the afternoon I had allotted to the effort. I was able to use Putty from a windows system to access the AMI instance but I would want to use a VNC server and TightVNC client to get too much further. Not having a Remote Desktop is a hindrance for the command line impaired among us.

The main drawback of AWS especially EC2 ( in comments I noted), is the lack of a Service Level Agreement. EC2 which is still in Beta will not guarantee the longevity of an instance which can be mitigated somewhat by adding a management/monitor tool like weoceo. However the biggest issue I can see for a GIS platform is the problem of persistent DB storage. The DB is part of an instance which is fine until an instance disappears. In that case provision needs to be made for a new instance setup using an AMI, but also a data transaction log capable of preserving any transactions since the last AMI backup. This problem does not seem to be solved very elegantly yet, although some people seem to be working on it, ScaleDB

If weogeo extends very far into the vector world, persistent DB will be necessary. Vector GIS applications, unlike imagery, rest directly on the backing geospatial DB. However, the larger issue in my mind is the difference between vector data models and large imagery objects. Vector DB resources are much more amenable to OWS type interfaces with immediate access expectations. The email url delivery approach will only be a frustration to weogeo users familiar with OWS services and primed by VE and GE to get immediate results. Even web1.0 sites like geocomm provide immediate ftp download to their library data.

It remains to be seen if the weogeo service will be successful in a larger sense. In spite of its beautiful Ruby Rails interface, its community wicki and blog services, its seller incentives, and its scalable AWS foundation, it will have to attract a significant user community with free and immediately accessible data resources to make it as a viable business in a web2.0 world.