2020 The Last Census?

Fig 1 - SF1QP Quantile Population County P0010001 P1.TOTAL POPULATION Universe: Total population

Preparation for US 2020 Census is underway at this mid-decennial point and we’ll see activity ramping up over the next few years. Will 2020 be the last meaningful decennial demographic data dump? US Census has been a data resource since 1790. It took a couple centuries for Census data to migrate into the digital age, but by Census 2000, data started trickling into the internet community. At first this was simply a primitive ftp data dump, ftp2.census.gov, still very useful for developers, and finally after 2011 exposed as OGC WMS, TigerWeb UI, and ESRI REST.

However, static data in general, and decennial static data in particular, is fast becoming anachronistic in the modern era. Surely the NSA data tree looks something like phone number JOIN Facebook account JOIN Twitter account JOIN social security id JOIN bank records JOIN IRS records JOIN medical records JOIN DNA sequence….. Why should this data access be limited to a few black budget bureaus? Once the data tree is altered a bit to include mobile devices, static demographics are a thing of the past. Queries in 2030 may well ask “how many 34 year old male Hispanic heads of households with greater than 3 dependents with a genetic predisposition to diabetes are in downtown Denver Wed at 10:38AM, at 10:00PM?” For that matter let’s run the location animation at 10 minute intervals for Tuesday and then compare with Sat.

“Privacy? We don’t need no stinking privacy!”

I suppose Men and Black may find location aware DNA queries useful for weeding out hostile alien grays, but shouldn’t local cancer support groups also be able to ping potential members as they wander by Star Bucks? Why not allow soda vending machines to check for your diabetic potential and credit before offering appropriate selections? BTW How’s that veggie smoothie?

Back to Old School

Fig 2 - SF1QD Quantile Density Census Block Group P0050008 P5.HISPANIC OR LATINO ORIGIN BY RACE Universe: Total population

By late 2011 census OGC services began to appear along with some front end data web UIs, and ESRI REST interfaces. [The ESRI connection is a tightly coupled symbiotic relationship as the Census Bureau, like many government bureaucracies, relies on ESRI products for both publishing and consuming data. From the outside ESRI could pass as an agency of the federal government. For better or worse “Arc this and that” are deeply rooted in the .gov GIS community.]

For mapping purposes there are two pillars of Census data, spatial and demographic. The spatial data largely resides as TIGER data while the demographic data is scattered across a large range of products and data formats. In the basement, a primary demographic resource is the SF1, Summary File 1, population data.

“Summary File 1 (SF 1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied).”

The intersection of SF1 and TIGER is the base level concern of census demographic mapping. There are a variety of rendering options, but the venerable color themed choropleth map is still the most widely recognized. This consists of assigning a value class to a color range and rendering polygons with their associated value color. This then is the root visualization of Census demographics, TIGER polygons colored by SF1 population classification ranges.

Unfortunately, access to this basic visualization is not part of the 2010 TigerWeb UI.

There are likely a few reasons for this, even aside from the glacially slow adoption of technology at the Bureau of the Census. A couple of obvious reasons are the sheer size of this data resource and the range of the statistics gathered. A PostGIS database with 5 level primary spatial hierarchy, all 48 SF1 population value files, appropriate indices, plus a few helpful functions consumes a reasonable 302.445 GB of a generic Amazon EC2 SSD elastic block storage. But, contained in those 48 SF1 tables are 8912 demographic values which you are welcome to peruse here. A problem for any UI is how to make 8912 plus 5 spatial levels usable.

Fig 3 – 47 SF1 tables plus sf1geo geography join file

Filling a gap

Since the Census Bureau budget did not include public visualization of TIGER/Demographics what does it take to fill in the gap? Census 2010 contains a large number of geographic polygons. The core hierarchy for useful demographic visualization is state, county, tract, block group, and block.

Fig 4 – Census polygon hierarchy

Loading the data into PostGIS affords low cost access to data for SF1 Polygon value queries such as this:

-- block tabblock
SELECT poly.GEOM, geo.stusab, geo.sumlev, geo.geocomp, geo.state, geo.county, geo.tract, geo.blkgrp, geo.block, poly.geoid10, sf1.P0010001, geo.stlogrecno
FROM tabblock poly
JOIN sf1geo geo ON geo.geoid10 = poly.geoid10
JOIN sf1_00001 sf1 ON geo.stlogrecno = sf1.stlogrecno
WHERE geo.geocomp='00' AND geo.sumlev = '101' AND ST_Intersects(poly.GEOM, ST_GeometryFromText('POLYGON ((-104.878035974004 38.9515291859429,-104.721023973742 38.9515291859429,-104.721023973742 39.063158980149,-104.878035974004 39.063158980149,-104.878035974004 38.9515291859429))', 4269))
ORDER BY geo.state, geo.county, geo.tract, geo.blkgrp, geo.block

Returning 1571 polygons in 1466 ms. Not too bad, but surely there’s room for improvement. Where is Paul Ramsey when you need him?

Fig 5 - PostgreSQL PostGIS Explain Query

Really Old School – WMS

Some considerations on the data:

A. Queries become unwieldy for larger extents with large numbers of polygons

Polygon Counts
county 3,233
tract 74,133
blockgroup 220,740
tabblock 11,166,336

These polygon counts rule out visualizations of the entire USA, or even moderate regions, at tract+ levels of the hierarchy. Vector mapping is not optimal here.

B. The number of possible image tile pyramids for 8912 values over 5 polygon levels is 5 * 8192 = 44,560. This rules out tile pyramids of any substantial depth without some deep Google like pockets for storage. Tile pyramids are not optimal either.

C. Even though vector grid pyramids would help with these 44,560 demographic variations, they suffer from the same restrictions as A. above.

One possible compromise of performance/visualization is to use an old fashioned OGC WMS GetMap request scheme that treats polygon types as layer parameters and demographic types as style parameters. With appropriate use of WMS <MinScaleDenominator> <MaxScaleDenominator> the layers are only rendered at sufficient zoom to reasonably limit the number of polygons. Using this scheme puts rendering computation right next to the DB on the same EC2 instance, while network latency is reduced to simple jpeg/png image download. Scaling access to public consumption is still problematic, but for in-house it does work.

Fig 6 – Scale dependent layer rendering for SF1JP - SF1 Jenks P0010001 (not density)

Fig 7 - a few of 8912 demographic style choices

There are still issues with a scale rendering approach. Since population is not very homogenous over US coverage extent, scale dependent rendering asks to be variable as well. This is easily visible over population centers. Without some type of pre-calculated density grid, the query is already completed prior to knowledge of the ideal scale dependency. Consequently, static rendering scales have to be tuned to high population urban regions. Since “fly over” US is generally less interesting to analysts, we can likely live with this compromise.

Fig 8 - SF1QD SF1 Quantile Density Census Tract P0010001/geographic Area

Classification schemes

Dividing a value curve to display adequately over a viewport range can be accomplished in a few different ways: equal intervals, equal quantile, jenks natural break optimization, K-means clustering, or “other.” Leaning toward the simpler, I chose a default quantile (guarantees some color) with a ten class single hue progression which of course is not recommended by color brewer. However 10 seems an appropriate number for decennial data. I also included a jenks classifier option which is considered a better representation. The classifier is based only on visible polygons rather than the entire polygon population. This means comparisons region to region are deceptive, but after all this is visualization of statistics.

“There are three kinds of lies: lies, damned lies, and statistics.” Mark Twain

Fig 9 – SF1JP SF1 Jenks Census Tract P0010001 (not density)

In order to manage Census data on a personal budget these compromises are involved:

1. Only expose SF1 demographic data for 2010 i.e. 8912 population value types
2. Only provide primary level polygon hierarchy – state, county, tract, blockgroup, block
3. Code a custom OGC WMS service – rendering GetMap image on the server
4. Resolution scale rendering to limit polygon counts down the polygon hierarchy
5. Provide only quantile and Jenks classifier options
6. Classifier applied only to viewport polygon selection

This is a workable map service for a small number of users. Exposing as an OGC WMS service offers some advantages. First there are already a ton of WMS clients available to actually see the results. Second, the Query, geometry parsing, and image computation (including any required re-projection) are all server side on the same instance reducing network traffic. Unfortunately the downside is that the computation cost is significant and discouraging for a public facing service.

Scaling could be accomplished in a few ways:

1. Vertical scaling to a high memory EC2 R3 instance(s) and a memory tuned PostGIS
2. Horizontal auto scaling to multiple instances with a load balancer
3. Storage scaling with pre-populated S3 tile pyramids for upper extents

Because this data happens to be read only for ten years, scaling is not too hard, as long as there is a budget. It would also be interesting to try some reconfiguration of data into NoSQL type key/value documents with perhaps each polygon document containing the 8912 values embedded along with the geometry. This would cost a bit in storage size but could decrease query times. NoSQL also offers some advantages for horizontal scaling.


The Census Bureau and its census are obviously not going away. The census is a bureaucracy with a curious inertial life stretching back to the founding of our country (United States Constitution Article 1, section 2). Although static aggregate data is not going to disappear, dynamic real time data has already arrived on stage in numerous and sundry ways from big data portals like Google, to marketing juggernauts like Coca Cola and the Democratic Party, to even more sinister black budget control regimes like the NSA.

Census data won’t disappear. It will simply be superseded.

The real issue for 2020 and beyond is, how to actually intelligently use the data. Already data overwhelms analytic capabilities. By 2030, will emerging AI manage floods of real time data replacing human analysts? If Wall Street still exists, will HFT algos lock in dynamic data pipelines at unheard of scale with no human intervention? Even with the help of tools like R Project perhaps the human end of data analysis will pass into anachronism along with the decennial Census.

Fig 10 - SF1JP SF1 Jenks Census Blocks P0010001

Hauling Out the Big RAM

Amazon released a handful of new stuff.

“Make that a Quadruple Extra Large with room for a Planet OSM”

Big Mmeory
Fig 1 – Big Foot Memory

1. New Price for EC2 instances

Linux Windows SQL Linux Windows SQL
m1.small $0.085 $0.12 $0.095 $0.13
m1.large $0.34 $0.48 $1.08 $0.38 $0.52 $1.12
m1.xlarge $0.68 $0.96 $1.56 $0.76 $1.04 $1.64
c1.medium $0.17 $0.29 $0.19 $0.31
c1.xlarge $0.68 $1.16 $2.36 $0.76 $1.24 $2.44

Notice the small instance, now $0.12/hr, matches Azure Pricing

Compute = $0.12 / hour

This is not really apples to apples since Amazon is a virtual instance, while Azure is per deployed application. A virtual instance can have multple service/web apps deployed.

2. Amazon announces a Relational Database Service RDS
Based on MySQL 5.1, this doesn’t appear to add a whole lot since you always could start an instance with any database you wanted. MySQL isn’t exactly known for geospatial even though it has some spatial capabilities. You can see a small comparison of PostGIS vs MySQL by Paul Ramsey. I don’t know if this comparison is still valid, but I haven’t seen much use of MySQL for spatial backends.

This is similar to Azure SQL Server which is also a convenience deployment that lets you run SQL Server as an Azure service, without all the headaches of administration and maintenance tasks. Neither of these options are cloud scaled, meaning that they are still single instance versions, not cross partition capable. SQL Azure Server CTP has an upper limit of 10Gb, as in hard drive not RAM.

3. Amazon adds New high memory instances

  • High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform $1.20-$1.44/hr
  • High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform $2.40-$2.88/hr

These are new virtual instance AMIs that scale up as opposed to scale out. Scaled out options use clusters of instances in the Grid Computing/Hadoop type of architectures. There is nothing to prohibit using clusters of scaled up instances in a hybridized architecture, other than cost. However, the premise of Hadoop arrays is “divide and conquer,” so it makes less sense to have massive nodes in the array. Since scaling out involves moving the problem to a whole new parallel programming paradigm with all of its consequent complexity, it also means owning the code. In contrast scaling up is generally very simple. You don’t have to own the code or even recompile just install on more capable hardware.

Returning us back to the Amazon RDS, Amazon has presumably taken an optimized compiled route and offers prepackaged MySQL 5.1 instances ready to use:

  • db.m1.small (1.7 GB of RAM, $0.11 per hour).
  • db.m1.large (7.5 GB of RAM, $0.44 per hour)
  • db.m1.xlarge (15 GB of RAM, $0.88 per hour).
  • db.m2.2xlarge (34 GB of RAM, $1.55 per hour).
  • db.m2.4xlarge (68 GB of RAM, $3.10 per hour).

Of course the higher spatial functionality of PostgreSQL/PostGIS can be installed on any of these high memory instances as well. It is just not done by Amazon. The important thing to note is memory approaches 100Gb per instance! What does one do with all that memory?

Here is one use:

“Google query results are now served in under an astonishingly fast 200ms, down from 1000ms in the olden days. The vast majority of this great performance improvement is due to holding indexes completely in memory. Thousands of machines process each query in order to make search results appear nearly instantaneously.”
Google Fellow Jeff Dean keynote speech at WSDM 2009.

Having very large memory footprints makes sense for increasing performance on a DB application. Even fairly large data tables can reside entirely in memory for optimum performance. Whether a database makes use of the best optimized compiler for Amazon’s 64bit instances would need to be explored. Open source options like PostgreSQL/PostGIS would let you play with compiling in your choice of compilers, but perhaps not successfully.

Todd Hoff has some insightful analysis in his post, “Are Cloud-Based Memory Architectures the Next Big Thing?”

Here is Todd Hoff’s point about having your DB run inside of RAM – remember that 68Gb Quadruple Extra Large memory:

“Why are Memory Based Architectures so attractive? Compared to disk, RAM is a high bandwidth and low latency storage medium. Depending on who you ask the bandwidth of RAM is 5 GB/s. The bandwidth of disk is about 100 MB/s. RAM bandwidth is many hundreds of times faster. RAM wins. Modern hard drives have latencies under 13 milliseconds. When many applications are queued for disk reads latencies can easily be in the many second range. Memory latency is in the 5 nanosecond range. Memory latency is 2,000 times faster. RAM wins again.”

Wow! Can that be right? “Memory latency is 2,000 times faster .”

(Hmm… 13 milliseconds = 13,000,000 nanoseconds
so 13,000,000n/5n = 2,600,000x? And 5Gb/s / 100Mb/s = 50x? Am I doing the math right?)

The real question, of course, is what will actual benchmarks reveal? Presumably optimized memory caching narrows the gap between disk storage and RAM. Which brings up the problem of configuring a Database to use large RAM pools. PostgreSQL has a variety of configuration settings but to date RDBMS software doesn’t really have a configuration switch that simply caches the whole enchilada.

Here is some discussion of MySQL front-ending the database with In-Memory-Data-Grid (IMDG).

Here is an article on a PostgreSQL configuration to use a RAM disk.

Here is a walk through on configuring PostgreSQL caching and some PostgreSQL doc pages.

Tuning for large memory is not exactly straightforward. There is no “one size fits all.” You can quickly get into Managing Kernel Resources. The two most important parameters are:

  • shared_buffers
  • sort_mem
“As a start for tuning, use 25% of RAM for cache size, and 2-4% for sort size. Increase if no swapping, and decrease to prevent swapping. Of course, if the frequently accessed tables already fit in the cache, continuing to increase the cache size no longer dramatically improves performance.”

OK, given this rough guideline on a Quadruple Extra Large Instance 68Gb:

  • shared_buffers = 17Gb (25%)
  • sort_mem = 2.72Gb (4%)

This still leaves plenty of room, 48.28Gb, to avoid dreaded swap pagein by the OS. Let’s assume a more normal 8Gb memory for the OS. We still have 40Gb to play with. Looking at sort types in detail may make adding some more sort_mem helpful, maybe bump to 5Gb. Now there is still an additional 38Gb to drop into shared_buffers for a grand total of 55Gb. Of course you have to have a pretty hefty set of spatial tables to use up this kind of space.

Here is a list of PostgreSQL limitations. As you can see it is technically possible to run out of even 68Gb.


Maximum Database Size Unlimited
Maximum Table Size 32 TB
Maximum Row Size 1.6 TB
Maximum Field Size 1 GB
Maximum Rows per Table Unlimited
Maximum Columns per Table 250 – 1600 depending on column types
Maximum Indexes per Table Unlimited

Naturally the Obe duo has a useful posting on determining PostGIS sizes: Determining size of database, schema, tables, and geometry

To get some perspective on size an Open Street Map dump of the whole world fits into a 90Gb EBS Amazon Public Data Set configured for PostGIS with pg_createcluster. Looks like this just happened a couple weeks ago. Although 90Gb is just a little out of reach for a for even a Quadruple Extra Large, I gather the current size of planet osm is still in the 60Gb range and you might just fit it into 55Gb RAM. It would be a tad tight. Well maybe the Octuple Extra Large Instance 136Gb instance is not too far off. Of course who knows how big Planet OSM will ultimately end up being.
See planet.openstreetmap.org

Another point to notice is the 8 virtual cores in a Quadruple Extra Large Instance. Unfortunately

“PostgreSQL uses a multi-process model, meaning each database connection has its own Unix process. Because of this, all multi-cpu operating systems can spread multiple database connections among the available CPUs. However, if only a single database connection is active, it can only use one CPU. PostgreSQL does not use multi-threading to allow a single process to use multiple CPUs.”

Running a single connection query apparently won’t benefit from a multi cpu virtual system, even though running multi threaded will definitely help with multiple connection pools.

I look forward to someone actually running benchmarks since that would be the genuine reality check.


Scaling up is the least complex way to boost performance on a lagging application. The Cloud offers lots of choices suitable to a range of budgets and problems. If you want to optimize personnel and adopt a decoupled SOA architecture, you’ll want to look at Azure + SQL Azure. If you want the adventure of large scale research problems, you’ll want to look at instance arrays and Hadoop clusters available in Amazon AWS.

However, if you just want a quick fix, maybe not 2000x but at least a some x, better take a look at Big RAM. If you do, please let us know the benchmarks!

Azure and GeoWebCache tile pyramids

Azure Blob storage tile pyramid
Fig 1 – Azure Blob Storage tile pyramid for citylimits

Azure Overview

Shared resources continue to grow as essential building blocks of modern life, key to connecting communities and businesses of all types and sizes. As a result a product like SharePoint is a very hot item in the enterprise world. You can possibly view Azure as a very big, very public, SharePoint platform that is still being constructed. Microsoft and 3rd party services will eventually populate the service bus of this Cloud version with lots and lots of service hooks. In the meantime, even early stage Azure with Web Hosting, Blob storage, and Azure SQL Server makes for some interesting experimental R&D.

Azure is similar to Amazon’s AWS cloud services, and Azure’s pricing follows Amazon’s lead with the familiar “pay as you go, buy what you use” model. Azure offers web services, storage, and queues, but instead of giving access to an actual virtual instance, Azure provides services maintained in the Microsoft Cloud infrastructure. Blob storage, Azure SQL Server, and IIS allow developers to host web applications and data in the Azure Cloud, but only with the provided services. The virtual machine is entirely hidden inside Microsoft’s Cloud.

The folks at Microsoft are probably well aware that most development scenarios have some basic web application and storage component, but don’t really need all the capabilities, and headaches, offered by controlling their own server. In return for giving up some freedom you get the security of automatic replication, scalability, and maintenance along with the API tools to connect into the services. In essence this is a Microsoft only Cloud since no other services can be installed. Unfortunately, as a GIS developer this makes Azure a bit less useful. After all, Microsoft doesn’t yet offer GIS APIs, OGC compliant service platforms, or translation tools. On the other hand, high availability with automatic replication and scalability for little effort are nice features for lots of GIS scenarios.

The current Azure CTP lets developers experiment for free with these minor restrictions:

  • Total compute usage: 2000 VM hours
  • Cloud storage capacity: 50GB
  • Total storage bandwidth: 20GB/day

To keep things simple, since this is my first introduction to Azure, I looked at just using Blob Storage to host a tile pyramid. The Silverlight MapControl CTP makes it very easy to add tile sources as layers so my project is simply to create a tile pyramid and store this in Azure Blob storage where I can access it from a Silverlight MapControl.

In order to create a tile pyramid, I also decided to dig into the GeoWebCache standalone beta 1.2. This is beta and offers some new undocumented features. It also is my first attempt at using geowebcache as standalone. Generally I just use the version conveniently built into Geoserver. However, since I was only building a tile pyramid rather than serving it, the standalone version made more sense. Geowebcache also provides caching for public WMS services. In cases where a useful WMS is available, but not very efficient, it would be nice to cache tiles for at least subsets useful to my applications.

Azure Blob Storage

Azure CTP has three main components:

  1. Windows Azure – includes the storage services for blobs, queues, and cloud tables as well as hosting web applications
  2. SQL Azure – SQL Server in the Cloud
  3. .NET Services – Service Bus, Access Control Service, Work Flow …

There are lots of walk throughs for getting started in Azure. It all boils down to getting the credentials to use the service.

Once a CTP project is available the next step is to create a “Storage Account” which will be used to store the tile pyramid directory. From your account page you can also create a “Hosted Service” within your Windows Azure project. This is where web applications are deployed. If you want to use “SQL Azure” you must request a second SQL Azure token and create a SQL Service. The .NET Service doesn’t require a token for a subscription as long as you have a Windows Live account.

After creating a Windows Azure storage account you will get three endpoints and a couple of keys.




Primary Access Key: ************************************
Secondary Access Key: *********************************

Now we can start using our brand new Azure storage account. But to make life much simpler first download the following:

Azure SDK includes some sample code . . . HelloWorld, HelloFabric, etc to get started using the Rest interface. I reviewed some of the samples and started down the path of creating the necessary Rest calls for recursively loading a tile pyramid from my local system into an Azure blob storage nomenclature. I was just getting started when I happened to take a look at the CloudDrive sample. This saved me a lot of time and trouble.

CloudDrive lets you treat the Azure service as a drive inside PowerShell. The venerable MSDOS cd, dir, mkdir, copy, del etc commands are all ready to go. Wince, I know, I know, MSDOS? I’m sure, if not now, then soon there will be dozens of tools to do the same thing with nice drag and drop UIs. But this works and I’m old enough to actually remember DOS commands.

First, using the elevated Windows Azure SDK command prompt you can compile and run the CloudDrive with a couple of commands:


Now open Windows PowerShell and execute the MounteDrive.ps1 script. This allows you to treat the local Azure service as a drive mount and start copying files into storage blobs.

Azure sample CloudDrive PowerShell
Fig 1 – Azure sample CloudDrive PowerShell

Creating a connection to the real production Azure service simply means making a copy of MountDrive.ps1 and changing credentials and endpoint to the ones obtained previously.

function MountDrive {
Param (
 $Account = "sampleaccount",
 $Key = "***************************************",

# Power Shell Snapin setup
 add-pssnapin CloudDriveSnapin -ErrorAction SilentlyContinue

# Create the credentials
 $password = ConvertTo-SecureString -AsPlainText -Force $Key
 $cred = New-Object -TypeName Management.Automation.PSCredential -ArgumentList $Account, $password

# Mount storage service as a drive
 new-psdrive -psprovider $ProviderName -root $ServiceUrl -name $DriveName -cred $cred -scope global

MountDrive -ServiceUrl "http://sampleaccount.blob.core.windows.net/" -DriveName "Blob" -ProviderName "BlobDrive"

The new-item command lets you create a new container with -Public flag ensuring that files will be accessible publicly. Then the Blog: drive copy-cd command will copy files and subdirectories from the local file system to the Azure Blob storage. For example:

PS Blob:\> new-item imagecontainer -Public
Parent: CloudDriveSnapin\BlobDrive::http:\\\devstoreaccount1

Type Size LastWriteTimeUtc Name
---- ---- ---------------- ----
Container 10/16/2009 9:02:22 PM imagecontainer

PS Blob:\> dir

Parent: CloudDriveSnapin\BlobDrive::http:\\\

Type Size LastWriteTimeUtc Name
---- ---- ---------------- ----
Container 10/16/2009 9:02:22 PM imagecontainer
Container 10/8/2009 9:22:22 PM northmetro
Container 10/8/2009 5:54:16 PM storagesamplecontainer
Container 10/8/2009 7:32:16 PM testcontainer

PS Blob:\> copy-cd c:\temp\image001.png imagecontainer\test.png
PS Blob:\> dir imagecontainer

Parent: CloudDriveSnapin\BlobDrive::http:\\\imagecontainer

Type Size LastWriteTimeUtc Name
---- ---- ---------------- ----
Blob 1674374 10/16/2009 9:02:57 PM test.png

Because imagecontainer is public the test.png image can be accessed in the browser from the local development storage with:
or if the image was similarly loaded in a production Azure storage account:

It is worth noting that Azure storage consists of endpoints, containers, and blobs. There are some further subtleties for large blobs such as blocks and blocklists as well as metadata, but there is not really anything like a subdirectory. Subdirectories are emulated using slashes in the blob name.
i.e. northmetro/citylimits/BingMercator_12/006_019/000851_002543.png is a container, “northmetro“, followed by a blob name,

The browser can show this image using the local development storage:

Changing to producton Azure means substituting a valid endpoint for “″ like this:

With CloudDrive getting my tile pyramid into the cloud is straightforward and it saved writing custom code.

The tile pyramid – Geowebcache 1.2 beta

Geowebcache is written in Java and synchronizes very well with the GeoServer OGC service engine. The new 1.2 beta version is available as a .war that is loaded into the webapp directory of Tomcat. It is a fairly simple matter to configure geowebcache to create a tile pyramid of a particular Geoserver WMS layer. (Unfortunately it took me almost 2 days to work out a conflict with an existing Geoserver gwc) The two main files for configuration are:

C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\
C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\

geowebcache-servlet.xml customizes the service bean parameters and geowebcache.xml provides setup parameters for tile pyramids of layers. Leaving the geowebcache-servlet.xml at default will work fine when no other Geoserver or geowebcache is around. It can get more complicated if you have several that need to be kept separate. More configuration info.

Here is an example geowebcache.xml that uses some of the newer gridSet definition capabilities. It took me a long while to find the schema for geowebcache.xml:
The documentation is still thin for this beta release project.

<?xml version="1.0" encoding="utf-8"?>
<gwcConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

After editing the configuration files, building the pyramid is a matter of pointing your browser at the local webapp and seeding the tiles down to the level you choose with the gridSet you want. The GoogleMapsCompatible gridSet is built into geowebcache and the BingMercator is a custom gridSet that I’ve added with extent limits defined.

This can take a few hours/days depending on the extent and zoom level you need. Once completed I use the CloudDrive PowerShell to copy all of the tiles into Azure blob storage:

PS Blob:\> copy-cd C:\Program Files\Apache Software Foundation\Tomcat 6.0\temp\geowebcache\citylimits

This also takes some time for the resulting 243,648 files of about 1Gb.

Silverlight MapControl

The final piece in the project is adding the MapControl viewer layer. First I add a new tile source layer in the Map Control of the MainPage.xaml

      Grid.Column="0" Grid.Row="1" Grid.RowSpan="1" Padding="5"
       <!-- Azure tile source -->
       <m:MapTileLayer x:Name="citylimitsAzureLayer" Opacity="0.5" Visibility="Collapsed">

The tile naming scheme is described here:
The important point is:

“Most filesystems use btree’s to store the files in directories, so layername/projection_z/[x/(2(z/2))]_[y/(2(z/2))]/x_y.extension seems reasonable, since it works sort of like a quadtree. The idea is that half the precision is in the directory name, the full precision in the filename to make it easy to locate problematic tiles. This will also make cache purges a lot faster for specific regions, since fewer directories have to be traversed and unlinked. “

An ordinary tile source class looks just like this:

  public class CityLimitsTileSource : Microsoft.VirtualEarth.MapControl.TileSource
        public CityLimitsTileSource() : base(App.Current.Host.InitParams["src"] +

        public override Uri GetUri(int x, int y, int zoomLevel)
           return new Uri(String.Format(this.UriFormat, x, y, zoomLevel));

However, now I need to reproduce the tile name as it is in the Azure storage container rather than letting gwc/service/gmaps mediate the nomenclature for me. This took a little digging. The two files I needed to look at turned out to be:

GMapsConverter works because Bing Maps follows the same upper left origin convention and spherical mercator projection as Google Maps. Here is the final approach using the naming system in Geowebcache1.2.

public class CityLimitsAzureTileSource : Microsoft.VirtualEarth.MapControl.TileSource
  public CityLimitsAzureTileSource()
  : base(App.Current.Host.InitParams["azure"] + "citylimits/GoogleMapsCompatible_{0}/{1}/{2}.png")

  public override Uri GetUri(int x, int y, int zoomLevel)
   * From geowebcache
   * http://geowebcache.org/trac/browser/trunk/geowebcache/src/main/java/org/geowebcache/storage/blobstore/file/FilePathGenerator.java
   * http://geowebcache.org/trac/browser/trunk/geowebcache/src/main/java/org/geowebcache/service/gmaps/GMapsConverter.java
   * must convert zoom, x, y, and z into tilepyramid subdirectory structure used by geowebcache
  int extent = (int)Math.Pow(2, zoomLevel);
  if (x < 0 || x > extent - 1)
     MessageBox.Show("The X coordinate is not sane: " + x);

  if (y < 0 || y > extent - 1)
     MessageBox.Show("The Y coordinate is not sane: " + y);
  // xPos and yPos correspond to the top left hand corner
  y = extent - y - 1;
  long shift = zoomLevel / 2;
  long half = 2 << (int)shift;
  int digits = 1;
  if (half > 10)
     digits = (int)(Math.Log10(half)) + 1;
  long halfx = x / half;
  long halfy = y / half;
  string halfsubdir = zeroPadder(halfx, digits) + "_" + zeroPadder(halfy, digits);
  string img = zeroPadder(x, 2 * digits) + "_" + zeroPadder(y, 2 * digits);
  string zoom = zeroPadder(zoomLevel, 2);
  string test = String.Format(this.UriFormat, zoom, halfsubdir, img );

  return new Uri(String.Format(this.UriFormat, zoom, halfsubdir, img));

  * From geowebcache
  * http://geowebcache.org/trac/browser/trunk/geowebcache/src/main/java/org/geowebcache/storage/blobstore/file/FilePathGenerator.java
  * a way to pad numbers with leading zeros, since I don't know a fast
  * way of doing this in Java.
  * @param number
  * @param order
  * @return
  public static String zeroPadder(long number, int order) {
  int numberOrder = 1;

  if (number > 9) {
    if(number > 11) {
      numberOrder = (int) Math.Ceiling(Math.Log10(number) - 0.001);
    } else {
      numberOrder = 2;

  int diffOrder = order - numberOrder;

    if(diffOrder > 0) {
      //System.out.println("number: " + number + " order: " + order + " diff: " + diffOrder);
      StringBuilder padding = new StringBuilder(diffOrder);

      while (diffOrder > 0) {
       return padding.ToString() + string.Format("{0}", number);
    } else {
      return string.Format("{0}", number);

I didn’t attempt to change the zeroPadder. Doubtless there is a simple C# String.Format that would replace the zeroPadder from Geowebcache.

This works and provides access to tile png images stored in Azure blob storage, as you can see from the sample demo.


Tile pyramids enhance user experience, matching the performance users have come to expect in Bing, Google, Yahoo, and OSM. It is resource intensive to make tile pyramids of large world wide extent and deep zoom levels. In fact it is not something most services can or need provide except for limited areas. Tile pyramids in the Cloud require relatively static layers with infrequent updates.

Although using Azure this way is possible and provides performance, scalability, and reliability, I’m not sure it always makes sense. The costs are difficult to predict for a high volume site as they are based on bandwidth usage as well as storage. Also you may be paying storage fees for many tiles seldom or never needed. Tile pyramid performance is a wonderful thing, but it chews up a ton of storage, much of which is seldom if ever used.

For a stable low to medium volume application it makes more sense to host a tile pyramid on your own server. Possibly with high volume sites where reliability is the deciding factor moving to Cloud storage services is the right thing. This is especially true where traffic patterns swing wildly or grow rapidly and robust scaling is an ongoing battle.

Azure CTP is of course not as mature as AWS, but obviously it has the edge in the developer community and like many Microsoft technologies it has staying power to spare. Leveraging its developer community makes sense for Microsoft and with easy to use tools built into Visual Studio I can see Azure growing quickly. In time it will just be part of the development fabric with most Visual Studio deployment choices seamlessly migrating out to the Azure Cloud.

Azure release is slated for Nov 2009.

Open up that data, Cloud Data

 James Fee looks at AWS data and here is the Tiger .shp snapshot James mentions: Amazon TIGER snapshot
More details here: Tom MacWright

Too bad it is only Linux/Unix since I’d prefer to attach to a Windows EC2. TIGER is there as raw data files ready to attach to your choice of Linux EC2. As is Census data galore.

But why not look further?  It’s interesting to think about other spatial data out in the Cloud.

Jeffrey Johnson adds a comment to spatially adjusted about OSM with the question – what form a pg_dump or a pg database? This moves a little beyond raw Amazon public data sets.

Would it be possible to provide an EBS volume with data already preloaded to PostGIS? A user could then attach the EBS ready to use. Adding a middle tier WMS/WFS like GeoServer or MapServer can tie together multiple PG sources, assuming you want to add other pg databases.

Jeffrey mentions one caveat about the 5GB S3 limit. Does this mark the high end of a snapshot requiring modularized splitting of OSM data? Doesn’t sound like S3 will be much help in the long run if OSM continues expansion.

What about OpenAerial? Got to have more room for OpenAerial and someday OpenTerrain(LiDAR)!
EBS – volumes from 1 GB to 1 TB. Do you need the snapshot (only 5GB) to start a new EBS? Can this accommodate OpenAerial tiles, OpenLiDAR X3D GeoElevationGrid LOD. Of course we want mix and match deployment in the Cloud.

Would it be posible for Amazon to just host the whole shebang? What do you think Werner?

Put it out there as an example of an Auto Scaling, Elastic Load Balancing OSM, OpenAerial tile pyramids as CloudFront Cache, OpenTerrain X3D GeoElevationGrid LOD stacks. OSM servers are small potatoes in comparison. I don’t think Amazon wants to be the Open Source Google, but with Google and Microsoft pushing into the Cloud game maybe Amazon could push back a little in the map end.

I can see GeoServer sitting in the middle of all this data delight handing out OSM to a tile client where it is stacked on OpenAerial, and draped onto OpenTerrain. Go Cloud, Go!

SimpleDB and locations

SimpleDB GeoRSS
Fig 1 – SimpleDB GeoRSS locations.

GeoRSS from SimpleDB

Amazons SimpleDB service is intriguing because it hints at the future of Cloud databases. Cloud databases need to be at least “tolerant of network partitions,” which leads inevitably to Werner Vogel’s “eventually consistent” cloud data. See previous blog post on Cloud Data. Cloud data is moving toward the scalability horizon discovered by Google. Last week’s announcement on AWS, Elastic Map Reduce, is another indicator of moving down the road toward infinite scalability.

SimpleDB is an early adopter of data in the Cloud and is somewhat unlike the traditional RDBMS. My interest is how the SimpleDB data approach might be used in a GIS setting. Here is my experiment in a nutshell:

  1. Add GeoNames records to a SimpleDB domain
  2. See what might be done with Bounding Box queries
  3. Export queries as GeoRSS
  4. Try multiple attributes for geographic alternate names
  5. Show query results in a viewer

GeoNames.org is a creative commons attribution license collection of GNS, GNIS, and other named point resources with over 8 million names. Since SimpleDB beta allows a single domain to grow up to 10 GB, the experiment should fit comfortably even if I later want to extend it to all countries. Calculating a rough estimate on a name item uses this forumla:
Raw byte size (GB) of all item IDs + 45 bytes per item + Raw byte size (GB) of all attribute names + 45 bytes per attribute name + Raw byte size (GB) of all attribute-value pairs + 45 bytes per attribute-value pair.

I chose a subset of 7 attributes from the GeoNames source <name, alternatenames, latitude, longitude, feature class, feature code, country code>
leading to this rough estimate of storage space:

  • itemid 7+45 = 52
  • attribute names 73+7*45 = 388
  • attribute values average 85 + 7*45 =400
  • total = 840bytes per item x 8000000 = 6.72 Gb

For experimental purposes I used just the Colombia tab delimited names file. There are 57,714 records in the Colombia, CO.txt, names file, which should be less than 50Mb. I chose a spanish language country to check that the utf-8 encoding worked properly.
2593108||Loma El Águila||Loma El Aguila||||5.8011111||7.2833333||T||HLL||CO||||36||||||||0||||151||America/Bogota||2002-02-25

Here are some useful links I used to get started with SimpleDB:
  Developer quide

I ran across this very “simple” SimpleDB code: ‘Simple’ SimpleDB code in single Java file/class (240 lines) This Java code was enhanced to add Map collections for Put and Get Attribute commands by Alan Williamson. I had to make some minor changes to allow for multiple duplicate key entries in the HashMap collections. I wanted to have the capability of using multiple “name” attributes for accomodating alternate names and then eventually alternate translations of names, so Map<String, ArrayList> replaces Map<String, String>

However, once I got into my experiment a bit I realized the limitations of urlencoded Get calls prevented loading the utf-8 char set found in Colombia’s spanish language names. I ended up reverting to the Java version of Amazon’s SimpleDB sample library. I ran into some problems since the Amazon’s SimpleDB sample library referenced jaxb-api.jar 2.1 and my local version of Tomcat used an older 2.0 version. I tried some of the suggestions for adding jaxb-api.jar to /lib/endorsed subdirectory, but in the end just upgrading to the latest version of Tomcat, 6.0.18, fixed my version problems.

One of the more severe limitations of SimpleDB is the single type “String.” To be of any use in a GIS application I need to do Bounding Box queries on latitude,longitude. The “String” type limitation carries across to queries by limiting them to lexicographical ordering. See: SimpleDB numeric encoding for lexicographic ordering In order to do a Bounding Box query with a lexicographic ordering we have to do some work on the latitude and longitude. AmazonSimpleDBUtil includes some useful utilities for dealing with float numbers.
  String encodeRealNumberRange(float number, int maxDigitsLeft, int maxDigitsRight, int offsetValue)
  float decodeRealNumberRangeFloat(String value, int maxDigitsRight, int offsetValue)

Using maxDigitsLeft 3, maxDigitsRight 7, along with offset 90 for latitude and offset 180 for longitude, encodes this lat,lon pair (1.53952, -72.313633) as (“0915395200″, “1076863670″) Basically these are moving a float to positive integer space and zero filling left and right to make the results fit lexicographic ordering.

Now we can use a query that will select by bounding box even with the limitation of a lexicographic ordering. For example Bbox(-76.310031, 3.889343, -76.285419, 3.914497) translates to this query:
Select * From GeoNames Where longitude > “1036899690″ and longitude < “1037145810″ and latitude > “0938893430″ and latitude < “0939144970″

Once we can select by an area of interest what is the best way to make our selection available? GeoRSS is a pretty simple XML feed that is consumed by a number of map viewers including VE and OpenLayer. Simple format point entries look like this:<georss:point>45.256 -71.92</georss:point> So we just need an endpoint that will query our GeoNames domain for a bbox and then use the result to create a GeoRSS feed.

<?xml version=”1.0″ encoding=”utf-8″?>
<feed xmlns=”http://www.w3.org/2005/Atom”
<title>GeoNames from SimpleDB</title>
<subtitle>Experiment with GeoNames in Amazon SimpleDB</subtitle>
<link href=”http://www.cadmaps.com/”/>
<name>Randy George</name>
<title>Resguardo Indígena Barranquillita</title>
<description><![CDATA[<a href="http://www.geonames.org/export/codes.html" target="_blank">feature class</a>:L <a
href="http://www.geonames.org/export/codes.html" target="_blank">feature code</a>
:RESV <a
href="http://ftp.ics.uci.edu/pub/websoft/wwwstat/country-codes.txt" target="_blank">country code</a>:CO ]]></description>
<georss:point>1.53952 -72.313633</georss:point>

There seems to be some confusion about GeoRSS mime type – application/xml, or text/xml, or application/rss+xml, or even application/georss+xml show up in a brief google search? In the end I used a Virtual Earth api viewer to consume the GeoRSS results, which isn’t exactly known for caring about header content anyway. I worked for awhile trying to get the GeoRSS acceptable to OpenLayers.Layer.GeoRSS but never succeeded. It easily accepted static .xml end points, but I never was able to get a dynamic servlet endpoint to work. I probably didn’t find the correct mime type.

The Amazon SimpleDB Java library makes this fairly easy. Here is a sample of a servlet using Amazon’s SelectSample.java approach.

Listing 1 – Example Servlet to query SimpleDB and return results as GeoRSS

This example servlet makes use of the nextToken to extend the query results past the 5s limit. There is also a limit to the number of markers that can be added in the VE sdk. From the Amazon website:
“Since Amazon SimpleDB is designed for real-time applications and is optimized for those use cases, query execution time is limited to 5 seconds. However, when using the Select API, SimpleDB will return the partial result set accumulated at the 5 second mark together with a NextToken to restart precisely from the point previously reached, until the full result set has been returned. “

I wonder if the “5 seconds” indicated in the Amazon quote is correct, as none of my queries seemed to take that long even with multiple nextTokens.

You can try the results here: Sample SimpleDB query in VE


SimpleDB can be used for bounding box queries. The response times are reasonable even with the restriction of String only type and multiple nextToken SelectRequest calls. Of course this is only a 57000 item domain. I’d be curious to see a plot of domain size vs query response. Obviously at this stage SimpleDB will not be a replacement for a geospatial database like PostGIS, but this experiment does illustrate the ability to use SimpleDB for some elementary spatial queries. This approach could be extended to arbitrary geometry by storing a bounding box for lines or polygons stored as SimpleDB Items. By adding additional attributes for llx,lly,urx,ury in lexicographically encoded format, arbitrary bbox selections could return all types of geometry intersecting the selection bbox.

Select * From GeoNames Where (llx > “1036899690″ and llx < “1037145810″ and lly > “0938893430″ and lly < “0939144970″)
or (urx > “1036899690″ and urx < “1037145810″ and ury > “0938893430″ and ury < “0939144970″)

Unfortunately, Amazon restricts attributes to 1024 bytes, which complicates storing vertex arrays. This practically speaking limits geometries to point data.

The only advantage offered by SimpleDB is extending the scalability horizon, which isn’t likely to be a problem with vector data.

New things in Amazon's Cloud

Amazon AWS made a big announcement yesterday regarding Windows on EC2:

There are now a number of Windows 2003 server ami options: Amazon Machine Images

Why does any of this matter to GIS markets? GIS distribution has been revolutionized by a battle of the titans Google Map vs Virtual Earth. The popularity of mashups and the continuing spread of location into enterprise business workflow has moved GIS into a browser interface model. However, the backend GIS is still there on servers. Utility cloud computing makes that back end service more affordable to businesses of all sizes, small to large. Even fortune 500 enterprises can make use of auto-scaling load balancing features for ad hoc distribution of location either internally or public facing.

Here are the Amazon Windows AMI offerings:

Amazon Public Images – Windows SQL Server Express + IIS + ASP.NET on Windows Server 2003 R2 (64bit)

Amazon Public Images – Windows SQL Server Express + IIS + ASP.NET on Windows Server 2003 R2 Enterprise Authenticated (64bit)

Amazon Public Images – Windows SQL Server 2005 Standard on Windows Server 2003 R2 Enterprise Authenticated (64bit)

Amazon Public Images – Windows SQL Server Express + IIS + ASP.NET on Windows Server 2003 R2 (32bit)

Amazon Public Images – Windows Server 2003 R2 (32bit)

Amazon Public Images – Windows Server Enterprise 2003 R2 (32bit)

Amazon Public Images – Windows Server 2003 R2 (64bit)

Amazon Public Images – Windows Server 2003 R2 Enterprise (64bit)

Amazon Public Images – Windows SQL Server 2005 Standard on Windows Server 2003 R2 (64bit)


Standard Instances Linux/UNIX Windows
Small (Default) $0.10 per hr $0.125 per hr
Large $0.40 per hr $0.50 per hr
Extra Large $0.80 per hr $1.00 per hr
High CPU Instances Linux/UNIX Windows
Medium $0.20 per hr $0.30 per hr
Extra Large $0.80 per hr $1.20 per hr

Windows prices are only slightly higher than the Linux counterparts and cheaper than GoGrid’s. The small windows instance at Amazon EC2 is $0.125/hr ($3 per day) and includes:

Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit 1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform

A similar GoGrid instance with 2Gb RAM + 160Gb Storage will run 2x$0.19 = $0.38/hr in the "Pay as You Go Pricing", considerably more than the Amazon instance. GoGrid does offer the newer Windows Server 2008 and prepaid plans are less expensive at $0.16 to $0.24 per hr for a similar configuration. Also the slick user interface at GoGrid shows the utility of a visual monitor.

Speaking of user interface, in addition to all of the Windows AMIs there are announcements of future features for EC2:

"Management Console – The management console will simplify the process of configuring and operating your applications in the AWS cloud. You’ll be able to get a global picture of your cloud computing environment using a point-and-click web interface."

"Load Balancing – The load balancing service will allow you to balance incoming requests and traffic across multiple EC2 instances. "

"Automatic Scaling – The auto-scaling service will allow you to grow and shrink your usage of EC2 capacity on demand based on application requirements."

"Cloud Monitoring – The cloud monitoring service will provide real time, multi-dimensional monitoring of host resources across any number of EC2 instances, with the ability to aggregate operational metrics across instances, Availability Zones, and time slots."

These will make EC2 easier to use. The Load Balancing and Management Console have been part of GoGrid’s cloud service for awhile now. They do make life easier. Auto-Scaling will be a great help too. Prior to this scaling has been a more or less manual process at EC2. The Windows market is not as used to command line Bash shell scripting so the introduction of visual UI monitor and control makes sense for this new cloud market.

Here is the lowest cost Windows AMI that will be popular with developers:
Amazon Public Images – Windows SQL Server Express + IIS + ASP.NET on Windows Server 2003 R2 (32bit)

It includes the basics for ASP .NET 2.0 web apps on IIS 6.0 with SQL Server Express 2005. Of course once on the system it is easy to upgrade to the newer .NET 3.5 and install all of the GIS stack items required.

Here is the procedure I followed for getting my first Windows ami started:

First sign up and pick up a private/public key along with X509 Certificate at EC2.

Download the latest version of ec2-api-tools

Installation includes setting additional environmental variable as described in the ec2 Getting Started Guide.


Add to path variable %EC2_HOME%\bin;

After installation verify that the correct ec2-api-tools are installed:
    ec2ver 1.3-26369 2008-08-08

Now we can use the ec2-api-tools:
   ec2-run-instances ami-3934d050 -k gsg-keypair
   ec2-describe-instances <resulting instance id>

Once your instance is running make sure remote desktop service port is open, at least for the default group:
   ec2-authorize default -p 3389

Also you will need to get the randomly assigned administrator password from the new instance using the instance id returned from ec2-describe-instances and the keypair generated earlier:
   ec2-get-password <your instance> -k <full pathname of the gsg_keypair file>

Now it is possible to Remote Desktop to the url furnished by ec2-describe-instances:
   User: administrator
   Pass: *********

Fig 1 – Amazon EC2 Windows 2003 basic instance

Amazon continues to expand its utility computing cloud. Virtual Windows OS has been a big hole out there, and Linux has grabbed a big lead in that market between Google web compute engines and Amazon EC2. Windows on EC2 opens Amazon utility computing to a much broader segment of the market and pushes deeper into the small business community. The economic turmoil of the times and consequent cost savings imperatives should make utility computing even more attractive to businesses large and small. It remains to be seen if Microsoft’s RedDog announcement at PDC will open a new competitive front in the utility computing world.

AWS to offer Windows + SQL Server


Amazon AWS team just sent out the announcement on a Windows OS offering for later this fall. This confirms rumors floating around on the AWS Roadmap, and will be a significant boost to the EC2 cloud computing. More here … http://aws.typepad.com/aws/2008/10/coming-soon-ama.html

GoGrid has been offering Windows + SQL Server virtual systems for awhile now. It will be interesting to see price comparisons. I imagine that like GoGrid the AWS Windows will cost more because of the MS license issue. The advantage of GoGrid has also been ease of use, hardware balancing, and nice monitoring tools. On the Amazon side is persistent storage S3 and EBS along with SQS. I’m looking forward to trying it out.

Cloud computing is growing. It is important as a platform for GIS. OGC WPS, WMS, WCS, WMS are making a mark on mapping SAAS and cloud platforms fit this model very well.

Posted in AWS

Spatial Analytics in the Cloud

Peter Batty has a couple interesting blogs on Netezza and their recent spatial analytics release. Basically Netezza has developed a parallel system, hardware/software, for processing large spatial queries with scaling improvements of 1 to 2 orders of magnitude over Oracle Spatial. This is apparently accomplished by pushing some of the basic filtering and projection out to the hardware disk reader as well as more commonly used parallel Map Reduce techniques. Ref Google’s famous white paper: http://labs.google.com/papers/mapreduce-osdi04.pdf

One comment struck me as Rich Zimmerman mentioned that use of their system eliminated indexing and tuning essentially using the efficiency of brute force parallel processing. There is no doubt that their process is highly effective and successful given the number of client buy ins as well as Larry Ellison’s attention. I suppose, though, that an Oracle buy out is generally considered the gold standard of competitive pain when Oracle is on the field.

In Peter’s interview with Rich Zimmerman they discuss a simple scenario in which a large number of point records (in the billion range) are joined with a polygon set and processed with a spatial ‘point in polygon’ query. This is the type of analytics that would be common in real estate insurance risk analytics and is typically quite time consuming. Evidently Netezza is able to perform these types of analytics in near real time, which is quite useful in terms of evolving risk situations such as wildfire, hurricane, earthquake, flooding etc. In these scenarios domain components are dynamically changing polygons of risk, such as projected wind speed, against a relatively static point set.

Netezza performance improvement factors over Oracle Spatial were in the 100 to 500 range with Netezza SPU arrays being anywhere from 50 to 1000. My guess would be that the performance curve would be roughly linear. The interview suggested an amazing 500x improvement over Oracle Spatial with an unknown number of SPUs. It would be interesting to see a performance versus SPU array size curve.

I of course have no details on the Netezza hardware enhancements, but I have been fascinated with the large scale clustering potential of cloud computing, the poor man’s supercomputer. In the Amazon AWS model, node arrays are full power virtual systems with consequent generality, as opposed to the more specific SPUs of the Netezza systems. However, cloud communications has to have a much larger latency compared to an engineered multi SPU array. On the other hand, would the $0.1/hr instance cost compare favorably to a custom hardware array? I don’t know, but a cloud based solution would be more flexible and scale up or down as needed. For certain, cost would be well below even a single cpu Oracle Spatial license.

Looking at the sited example problem, we are faced with a very large static point set and a smaller dynamically changing polygon set. The problem is that assigning polygons of risk to each point requires an enormous number of ‘point in polygon’ calculations.

In thinking about the type of analytics discussed in Peter’s blog the question arises, how could similar spatial analytics be addressed in the AWS space? The following speculative discussion looks at the idea of architecting an AWS solution to the class of spatial analysis problems mentioned in the Netezza interview.

The obvious place to look is AWS Hadoop

Since Hadoop was originally written by the Apache Lucene developers as a way to improve text based search, it does not directly address spatial analytics. Hadoop handles the overhead of scheduling, automatic parallelization, and job/status tracking. The Map Reduce algorithm is provided by the developer as essentially two Java classes:

  Map – public static class MapClass extends MapReduceBase implements Mapper{ … }
  Reduce – public static class ReduceClass extends MapReduceBase implements Reducer { …. }

In theory, with some effort, the appropriate Java Map and Reduce classes could be developed specific to this problem domain, but is there another approach, possibly simpler?

My first thought, like Netezza’s, was to leverage the computational efficiency of PostGIS over an array of EC2 instances. This means dividing the full point set into smaller subsets, feeding these subset computations to their own EC2 instance and then aggregating the results. In my mind this involves at minimum:
 1. a feeder EC2 instance to send sub-tiles
 2. an array of EC2 computational instances
 3. a final aggregator EC2 instance to provide a result.

One approach to this example problem is to treat the very large point set as an array tiled in the tiff image manner with a regular rectangular grid pattern. Grid tiling only needs to be done once or as part of the insert/update operation. The assumptions here are:
 a. the point set is very large
 b. the point set is relatively static
 c. distribution is roughly homogenous

If c is not the case, grid tiling would still work, but with a quad tree tiling pattern that subdivides dense tiles into smaller spatial extents. Applying the familiar string addressing made popular by Google Map and then Virtual Earth with its 0-3 quadrature is a simple approach to tiling the point table.

Fig 2 – tile subdivision

Recursively appending a char from 0 to 3 for each level provides a cell identifier string that can be applied to each tile. For example ’002301′ identifies a tile cell NW/NW/SW/SE/NW/NE. So the first step, analogous to spatial indexing, would be a pass through the point table calculating tile signatures for each point. This is a time consuming preprocess, basically iterating over the entire table and assigning each point to a tile. An initial density guess can be made to some tile depth. Then if the point tiles are not homogenous (very likely), tiles with much higher counts are subdivided recursively until a target density is reached.

Creating an additional tile geometry table during the tile signature calculations is a convenience for processing polygons later on. Fortunately the assumption that the point table is relatively static means that this process occurs rarely.

The tile identifier is just a string signature that can be indexed to pull predetermined tile subsets. Once completed there is a point tile set available for parallel processing with a simple query.
 SELECT point.wkb_geom, point.id
  FROM point
  WHERE point.tile = tilesignature;

Note that tile size can be manipulated easily by changing the WHERE clause slightly to reduce the length of the tile signature. In effect this combines 4 tiles into a single parent tile (’00230*’ = ’002300′ +’002301′ + ’002302′ + ’002303′ )
 SELECT point.wkb_geom, point.id
  FROM point
   (substring(tilesignature from 0 for( length(tilesignature)-1))||’*') LIKE point.tile;

Assuming the polygon geometry set is small enough, the process is simply feeding sub-tile point sets into ‘point in polygon’ replicated queries such as this PostGIS query:
 SELECT point.id
  FROM point, polygon
    point.wkb_geom && polygon.wkb_geom
   AND intersects(polygon.wkb_geom, point.wkb_geom);

This is where the AWS cloud computing could become useful. Identical CPU systems can be spawned using a preconfigured EC2 image with Java and PostGIS installed. A single feeder instance contains the complete point table with tile signatures as an indexed PostGIS table. A Java feeder class then iterates through the set of all tiles resulting from this query:
 ···SELECT DISTINCT point.tile FROM points ORDER BY point.tile

Using a DISTINCT query eliminates empty tiles as opposed to simply iterating over the entire tile universe. Again a relatively static point set indicates a static tile set. So this query only occurs in the initial setup. Alternatively a select on the tile table where the wkb_geom is not null would produce the same result probably more efficiently.

Each point set resulting from the query below is then sent to its own AWS EC2 computation instance.
 foreach tilesignature in DISTINCT point.tile
  SELECT point.wkb_geom, point.id
  FROM points
  WHERE point.tile = tilesignature;


The polygon set also has assumptions:
 a. the polygon set is dynamically changing
 b. the polygon set is relatively small

Selecting a subset of polygons to match a sub-tile of points is pretty efficient using the tile table created earlier:
 SELECT polygon.wkb_geom
   FROM tile INNER JOIN polygon ON (polygon.tile = tile.id);
  WHERE tile.wkb_geom && polygon.wkb_geom;

Now the feeder instance can send a subset of points along with a matching subset of polygons to a computation EC2 instance.

Connecting EC2 instances

However, at this point I run into a problem! I was simply glossing over the “send” part of this exercise. The problem in most parallel algorithms is the communication latency between processors. In an ideal world shared memory would make this extremely efficient, but EC2 arrays are not connected this way. The cloud is not so efficient.

AWS services do include additional tools, Simple Storage Service, S3 , Simple Queue Service, SQS , and SimpleDB. S3 is a type of shared storage. SQS is a type of asynchronous message passing, while SimpleDB provides a cloud based DB capability on structured data.

S3 is appealing because writing collections of polygon and point sets should be fairly efficient, one S3 object per tile unit. At the other end, computation instances would read from the S3 input bucket and write results back to a result output bucket. The aggregator instance can then read from the output result bucket.

However, implicit in this S3 arrangement is a great deal of schedule messaging. SQS is an asynchronous messaging system provided for this type of problem. Since messages are being sent anyway, why even use S3? SQS messages are limited to 8k of text so they are not sufficient for large object communications. Besides point sets may not even change from one cycle to the next. The best approach is to copy each tile point set to an S3 Object, and separate S3 objects for polygon tile sets. Then add an SQS message to the queue. The computation instances read from the SQS message queue and load the identified S3 objects for processing. Note that point tile sets will only need to be written to S3 once at the initial pass. Subsequent cycles will only be updating the polygon tile sets. Hadoop would handle all of this in a robust manner taking into account failed systems and lost messages so it may be worth a serious examination.

SimpleDB is not especially useful in this scenario, because the feeder instance’s PostGIS is much more efficient at organizing tile objects. As long as the point and polygon tables will fit in a single instance it is better to rely on that instance to chunk the tiles and write them to S3, then alerting computational instances via SQS.

Once an SQS message is read by the target computation instance how exactly should we arrange the computation? Although tempting, use of PostGIS again brings up some problems. The point and polygon object sets would need to be added to two tables, indexed, and then queried with “point in polygon.” This does not sound efficient at all! A better approach might be to read S3 objects with their point and polygon geometry sets through a custom Java class based on the JTS Topology Suite

Our preprocess has already optimized the two sets using a bounds intersect based on a tile structure so plain iteration of all points over all polygons in a single tile should be fairly efficient. If the supplied chunk is too large for the brute force approach, a more sophisticated JTS extension class could index by polygon bbox first and then process with the Intersect function. This would only help if the granularity of the message sets was large. Caching tile point sets on the computational instances could also save some S3 reads reducing the computation setup to a smaller polygon tile set read.

This means that there is a bit of experimental tuning involved. A too fine grained tile chews up time in the messaging and S3 reads, while a coarse grained tile takes more time in the Intersect computation.

Finally each computation instance stores its result set to an S3 result object consisting of a collection of point.id and any associated polygon.ids that intersect the point. Sending an SQS mesage to the aggregator alerts it to availability of result updates. At the other end is an aggregator, which takes the S3 result objects and pushes them into an association table of point.id, polygon.id, or pip table. The aggregator instance can be a duplicate of the original feeder instance with its complete PostGIS DB already populated with the static point table and the required relation table (initially empty).

If this AWS system can be built and process in reasonable time an additional enhancement suggests itself. Assuming that risk polygons are being generated by other sources such as the National Hurricane Center, it would be nice to update the polygon table on an ongoing basis. Adding a polling class to check for new polygons and update our PostGIS table, would allow the polygons to be updated in near real time. Each time a pass through the point set is complete it could be repeated automatically reflecting any polygon changes. Continuous cycling through the complete tile set incrementally updates the full set of points.

At the other end, our aggregator instance would be continuously updating the point.id, polygon.id relation table one sub-tile at a time as the SQS result messages arrive. The decoupling afforded by SQS is a powerful incentive to use this asynchronous message communication. The effect is like a slippy map interface with subtiles continuously updating in the background, automatically registering risk polygon changes. Since risk polygons are time dependent it would also be interesting to keep timestamped histories of the polygons, providing for historical progressions by adding a time filter to our tile polygon select. The number of EC2 computation instances determine speed of these update cycles up to the latency limit of SQS and S3 read/writes.

Visualization of the results might be an interesting exercise in its own right. Continuous visualization could be attained by making use of the aggregator relation table to assign some value to each tile. For example in pseudo query code:
foreach tile in tile table {
···SELECT AVG(polygon.attribute)
  FROM point, pip, polygon WHERE pip.pointid = point.id AND polygon.id = pip.polygonid)
   AND point.tile = tilesignature;

Treating each tile as a pixel lets the aggregator create polygon.value heat maps assigning a color and/or alpha transparency to each png image pixel. Unfortunately this would generally be a coarse image but it could be a useful kml GroundOverlay at wide zooms in a Google Map Control. These images can be readily changed by substituting different polygon.attribute values.

If Google Earth is the target visualization client using a Geoserver on the aggregator instance would allow a kml reflector to kick in at lower zoom levels to show point level detail as <NetworkLink> overlays based on polygon.attributes associated with each point. GE is a nice client since it will handle refreshing the point collection after each zoom or pan, as long as the view is within the assigned Level of Detail. Geoserver kml reflector essentially provides all this for almost free once the point featureType layer is added. Multiple risk polygon layers can also be added through Geoserver for client views with minimal effort.


This is pure speculation on my part since I have not had time or money to really play with message driven AWS clusters. However, as an architecture it has merit. Adjustments in the tile granularity essentially adjust the performance up to the limit of SQS latency. Using cheap standard CPU instances would work for the computational array. However, there will be additional compute loads on the feeder and aggregator, especially if the aggregator does double duty as a web service. Fortunately AWS provides scaling classes of virtual hardware as well. Making use of a Feeder instance based on medium CPU adds little to system cost:
$0.20 – High-CPU Medium Instance
1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform
(note: a High CPU Extra Large instance could provide enough memory for an in memory point table – PostrgeSQL memory Tuning)

The aggregator end might benefit from a high cpu instance:
$0.80 – High-CPU Extra Large Instance
7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform

A minimal system might be $0.20 feeder => five $0.10 computation instance => $0.80 aggregator = $1.10/hr plus whatever data transfer costs accrue. Keeping the system in a single zone to reduce SQS latency would be a good idea and in zone data costs are free.

Note that 5 computation instances are unlikely to provide sufficient performance. However, a nice feature of AWS cloud space is the adjustability of the configuration. If 5 is insufficient add more. If the point set is reduced drop off some units. If the polygon set increases substantially divide off the polygon tiling to its own high CPU instance. If your service suddenly gets slashdotted perhaps a load balanced webservice farm could be arranged? The commitment is just what you need and can be adjusted within a few minutes or hours not days or weeks.


Again this is a speculative discussion to keep my notes available for future reference. I believe that this type of parallelism would work for the class of spatial analytics problems discussed. It is particularly appealing to web visualization with continuous updating. The cost is not especially high, but then unknown pitfalls may await. Estimating four weeks of development and $1.50/hr EC2 costs leads to $7000 – $8000 for proof of concept development with an ongoing operational cost of about $1500/mo for a small array of 10 computational units. The class of problems involving very large point sets against polygons should be fairly common in insurance risk analysis, emergency management, and Telco/Utility customer base systems. Cloud arrays can never match the 500x performance improvement of Netezza, but cost should place it in the low end of the cost/performance spectrum. Maybe 5min cycles rather than 5sec are good enough. Look out Larry!

A quick look at GoGrid

Fig 1 a sample ASP .NET 3.5 website running on a GoGrid server instance

GoGrid is a cloud service similar to AWS.( http://www.gogrid.com ) Just like Amazon’s AWS EC2, the user starts a virtual server instance from a template and then uses the instance like a dedicated server. The cost is similar to AWS, starting at about $0.10 per hour for a minimal server. The main difference from a user perspective is the addition of Windows servers and an easy to use control panel. The GoGrid control panel provides point and click setup of server clusters with even a hardware load balancer .

The main attraction for me is the availability of virtual Windows Servers. There are several Windows 2003 configuration templates as well as sets of RedHat or CentOS Linux templates:
· Windows 2003 Server (32 bit)/ IIS
· Windows 2003 Server (32 bit)/ IIS/ASP.NET/SQL Server 2005 Express Edition
· Windows 2003 Server (32 bit)/ SQL Server 2005 Express Edition
· Windows 2003 Server (32 bit)/ SQL Server 2005 Workgroup Edition
· Windows 2003 Server (32 bit)/ SQL Server 2005 Standard Edition

The number of templates is more limited than EC2 and I did not see a way to create custom templates. However, this limitation is offset by ease of management. For my experiment I chose the Windows 2003 Server (32 bit)/ IIS/ASP.NET/SQL Server 2005Express Edition. This offered the basics I needed to serve a temporary ASP web application.

After signing up, I entered my GoGrid control panel. Here I can add a service by selecting from the option list.

Fig 2- GoGrid Control Panel

Filling out a form with the basic RAM, OS, and Image lets me add a WebbApp server to my panel. I could additionally add several WebAPP servers and configure a LoadBalancer along with a Backend Database server by similarly filling out Control Panel forms.This appears to take the AWS EC2 service a step further by letting typical scaling workflows be part of the front end GUI. Although scaling in this manner can be done in AWS it requires installation of a software Load Balancer on one of the EC2 instances and a manual setup process.

Fig 3 – example of a GoGrid WebAPP configuration form

Once my experimental server came on line I was able to RemoteDesktop into the server and begin configuring my WebAPP. I first installedthe Microsoft .NET 3.5 framework so I could make use of some of its new features. I then copied up a sample web application showing the use of a GoogleMap Earth mode control in a simple ASP interface. This is a display interface which is connected to a different database server for displaying GMTI results out of a PostGIS table.

Since I did not want to point a domain at this experimental server, I simply assigned the GoGrid IP to my IIS website. I ran into a slight problem here because the sample webapp was created using .NET 3.5System.Web.Extensions. The webapp was not able to recognize the extension configurations in my WebConfig file. I tried copying the System.Web.Extensions.dlls into my webapp bin file. However, I was still getting errors. I then downloaded the ASP Ajax control and installed it on the GoGrid server but still was unable to get the website to display. Finally I went back to Visual Studio and remade the webapp using the ASP.NET Web App template without the extensions. I was then able to upload to my GoGrid server and configure IIS to see my website as the default http service.

There was still one more problem. I could see the website from the local GoGrid system but not from outside. After contacting GoGrid support I was quickly in operation with a pointer to the Windows Firewall which GoGrid Support kindly fixed for me. The problem was that theWindows 2003 template I chose does not open port 80 by default. I needed to use the Firewall manager to open port 80 for the http service. For those wanting to use ftp the same would be required for port 21.

I now had my experimental system up and running. I had chosen a 1Gb memory server so my actual cost on the server is $0.19/hour which is a little less for your money than the AWS EC2:

$0.10Small Instance (Default)
1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform

But again, running ASP .NET 3.5 is much more complex on EC2, requiring a Mono installation on a Linux base. I have not yet tried that combination and somehow doubt that it would work with a complex ASP .NET 3.5 website, especially with Ajax controls.

The GoogleMap Control with the Earth mode was also interesting. I had not yet embedded this into an ASP website. It proved to be fairly simple. I just needed to add a <asp:ScriptManager ID=”mapscriptmanager” runat=”server”/> to my site Master page and then the internal page javascript used to create the GoogleMap Control worked as normal.

I had some doubts about accessing the GMTI points from the webapp since often there are restrictions using cross domain xmlhttpRequests. There was no problem. My GMTI to KML servlet produces kml mime type "application/vnd.google-earth.kml+xml" which is picked up in the client javascript using the Google API:·
geoXml = new GGeoXml(url);

Evidently cross domain restrictions did not apply in this case, which made me happy, since I didn’t have to write a proxy servlet just to access the gmti points on a different server.

In Summary GoGrid is a great cloud service which finally opens the cloud to Microsoft shops where Linux is not an option. The GUI control panel is easy to use and configuring a fully scalable load balanced cluster can be done right from the control panel. GoGrid fills a big hole in the cloud computing world.

Deep Zoom a TerraServer UrbanArea on EC2

Fig 1 – Silverlight MultiScaleImage of a high resolution Denver image – 200.6Mb .png

Just to show that I can serve a compiled Deep Zoom Silverlight app from various Apache servers I loaded this Denver example on a Windows 2003 Apache Tomcat here: http://www.web-demographics.com/Denver, and then a duplicate on a Linux Ubuntu7.10 running as an instance in the Amazon EC2, this time using Apache httpd not Tomcat: http://www.gis-ows.com/Denver Remember these are using beta technology and will requires updating to Silverlight 2.0. The Silverlight install is only about 4.5Mb so the install is relatively painless on a normal bandwidth connection.

Continuing the exploration of Deep Zoom, I’ve had a crash course in Silverlight. Silverlight is theoretically cross browser compatible (at least for IE, Safari, and FireFox), and it’s also cross server. The trick for compiled Silverlight is to use Visual Studio 2008 with .NET 3.5 updates. Under the list of new project templates is a template called ‘Silverlight application’. Using this template sets up a project that can be published directly to the webapp folder of my Apache Server. I have not tried a DeepZoom MultiScaleImage on Linux FireFox or Mac Safari clients. However, I can view this on a Windows XP FireFox updated to Silverlight 2.0Beta as well as Silverlight updated IE7 and IE8 beta.

Creating a project called Denver and borrowing liberally from a few published examples, I was able to add a ClientBin folder under my Denver_Web project folder. Into this folder goes the pyramid I generate using Deep Zoom Composer. Once the pyramid is copied into place I can reference this source from my MultiScaleImage element source. Now the pyramid is viewable.

To make the MultiScaleImage element useful, I added a couple of additional .cs touches for mousewheel and drag events. Thanks to the published work of Lutz Gerhard, Peter Blois, and Scott Hanselman this was just a matter of including a MouseWheelHelper.cs in the project namespace and adding a few delegate functions to the main Page initialization code behind file. Pan and Zoom .cs

Now I need to backtrack a bit. How do I get some reasonable Denver imagery for testing this Deep Zoom technology? Well I don’t belong to DRCOG which I understand is planning on collecting 6″ aerials. There are other imagery sets floating around Denver, as well, I believe down to 3″ pixel resolution. However, the cost of aerial capture precludes any free and open source type of use. However, there is some nice aerial data available from the USGS. The USGS Urban Area imagery is available for a number of metropolitan areas, including Denver.

Fig 2 – Same high resolution Denver image zoomed in to show detail

USGS Urban Area imagery is a color orthorectified image set captured at approximately 1ft pixel resolution. The data is made available to the public through the TerraServer WMS. Looking over the TerraServer UrbanArea GetCapabilities layer I see that I can ‘GetMap’ this layer in EPSG:26913 (UTM83-13m). The best possible pixel resolution through the TerraServer WMS is 0.25m per pixel. To achieve this level of resolution I can use the max pixel Height and Width of 2000 over a metric bounding box of 500m x 500m. http://gisdata.usgs.net/IADD/factsheets/fact.html

For example:

This is nice data but I want to get the max resolution for a larger area and mosaic the imagery into a single large image that I will then feed into the Deep Zoom Composer tool for building the MultiScaleImage pyramid. Java is the best tool I have to make a simple program to connect to the WMS and pull down my images one at a time into the tiff format.
try {
File OutFile = new File(dir+imageFileName);
URL u = new URL(url);
HttpURLConnection geocon = (HttpURLConnection)u.openConnection();
BufferedImage image = ImageIO.read(geocon.getInputStream());
System.out.println(“download completed to “+dir+imageFileName+” “+bbox);

Looping this over my desired area creates a directory of 11.7Mb tif images. In my present experiment I grabbed a set of 6×6 tiles, or 36 tiff files at a total of 412Mb. The next step is to collect all of these tif tiles into a single mosaic. The Java JAI package contains a nice tool for this called mosaic:
mosaic = JAI.create(“mosaic”, pbMosaic, new RenderingHints(JAI.KEY_IMAGE_LAYOUT, imageLayout));

Iterating pbMosaic.addSource(translated); over my set of TerraServer tif files and then using PNGImageEncoder, I am able to create a single png file of about 200Mb. Now I have a sufficiently large image to drop into the Deep Zoom Composer for testing. The resulting pyramid of jpg files is then copied into my ClientBin subdirectory of the Denver VS2008 project. From there it is published to the Apache webapp. Now I can open my Denver webapp for viewing the image pyramid. On this client system with a good GPU and dual core cpu the image zoom and pan is quite smooth and replicates a nice local application viewing program with smooth transitions around real time zoom pan space. On an older Windows XP running FireFox the pan and zoom is very similar. This is on a system with no GPU so I am impressed.

Peeking into the pyramid I see that the bottom level 14 contains 2304 images for a 200Mb png pyramid. Each image stays at 256×256 and the compression ranges from 10kb to 20kb per tile. Processing into the jpg pyramid compresses from the original 412Mb tif set => 200.5Mb png mosaic => 45.7Mb 3084 file jpg pyramid. Evidently there is a bit of lossy compression, but the end effect is that the individual tiles are small enough to stream into the browser at a decent speed. Connected with high bandwidth the result is very smooth pan and zoom. This is basically a Google Earth or Virtual Earth user experience all under my control!

Now that I have a workflow and a set of tools, I wanted to see what limits I ran into. The next step was to increment my tile set to an 8×8 for 64 tifs to see if my mosaic tool would endure the larger size as well as the DeepZoom Composer. My JAI mosaic will be the sticking point on a maximum image size since the source images are built in memory which on this machine is 3Gb. Taking into account Vista’s footprint I can actually only get about 1.5Gb. One possible workaround to that bottleneck is to create several mosaics and then attempt to splice them in the Deep Zoom Composer by manually positioning them before exporting to a pyramid.

First I modified my mosaic program to write a Jpeg output with jpgParams.setQuality(1.0f); This results in a faster mosaic and a smaller export. The JAI PNG encoder is much slower than JPEG. With this modification I was able to export a couple of 3000m x 3000m mosaics as jpg files. I then used Deep Zoom Composer to position the two images horizontally and exported as a single collection. In the end the image pyramid is 6000m x 3000m and 152Mb of jpg tiles. It looks like I might be able to scale this up to cover a large part of the Denver metro UrbanArea imagery.

The largest mosaic I was able to get Deep Zoom Composer to accept was 8×8 or 16000px x 16000px which is just 4000m x 4000m on the ground. Feeding this 143Mb mosaic through Composer resulted in a pyramid consists of 5344 jpg files at 82.3Mb. However, scaling to a 5000m x 5000m set of 100 tif, the 221Mb mosaic, failed on import to Deep Zoom Composer. I say failed, but in this prerelease version the import finishes with a blank image shown on the right. Export works in the usual quirky fashion in that the export progress bar generally never stops, but in this case the pyramid also remains empty. Another quirky item to note is that each use of Deep Zoom Composer starts a SparseImageTool.exe process which continues consuming about 25% of cpu even after the Deep Zoom Composer is closed. After working awhile you will need to go into task manager and close down these processes manually. Apparently this is “pre-release.”

Fig 3 – Same high resolution Denver image zoomed in to show detail of Coors Field players are visible

Deep Zoom is an exciting technology. It allows map hackers access to real time zoom and pan of large images. In spite of some current size limitations on the Composer tool the actual pyramid serving appears to have no real limit. I verified on a few clients and was impressed that this magic works in IE and FireFox although I don’t have a Linux or Mac client to test. The compiled code serves easily from Apache and Tomcat with no additional tweaking required. My next project will be adapting these Deep Zoom pyramids into a tile system. I plan to use either an OWS front end or a Live Maps with a grid overlay. The deep zoom tiles can then be accessed by clicking on a tile to open a Silverlight MultiScaleImage. This approach seems like a simple method for expanding coverage over a larger metropolitan area while still using the somewhat limiting Deep Zoom Composer pre release.