SimpleDB Alternate Language GeoNames

SimpleDB GeoRSS
Fig 1 – SimpleDB GeoRSS alternate language names.

SimpleDB alternate language names

One of the areas of interest using SimpleDB is the ability to add multiple attribute values. Here is the overview from Amazon’s Service Highlights.

Flexible – With Amazon SimpleDB, it is not necessary to pre-define all of the data formats you will need to store; simply add new attributes to your Amazon SimpleDB data set when needed, and the system will automatically index your data accordingly. The ability to store structured data without first defining a schema provides developers with greater flexibility when building applications, and eliminates the need to re-factor an entire database as those applications evolve.”

As an extension to my previous blog post I decided to try adding alternative language names:

  French – Albuquerque, Nouveau-Mexique
  Portuguese – Albuquerque Novo México
  Japanese ニューメキシコ州アルバカーキ
  Chinese traditional 美國新墨西哥州阿爾伯克基
  Arabic البوكيرك (نيو مكسيكو
  Russian Альбукерке, Нью-Мексико

Using Amazon’s sdb library again, allows adding additional attributes to an individual item:

AmazonSimpleDB service = new AmazonSimpleDBClient(accessKeyId, secretAccessKey);
try {
  HashMap hm = new HashMap();
  List<ReplaceableAttribute> attributeListGeoName = new ArrayList<ReplaceableAttribute>(1);
  attributeListGeoName.add(new ReplaceableAttribute(attrName, attrValue, false));
  PutAttributesRequest request = new PutAttributesRequest(domainName, itemID, attributeListGeoName);
  invokePutAttributes(service, request);
} catch (Exception e) {e.printStackTrace();}

After adding several languages for ‘Albuquerque, New Mexico,’ I am able to display them as GeoRSS and then as tooltip text in the Virtual Earth API viewer.

I had to add an explicit character encoding to my http response like this:
Once that was done I could reliably get UTF-8 character strings in the GeoRSS xml returned to the viewer.

I am not multilingual, not even bilingual really, so where to go for alternate language translations? I had read about an interesting project over at Google: Language Tools

Here I could simply run a translate on my geoname for whatever languages are offered by Google. I cannot vouch for their accuracy, but I understand that Google has developed a statistically based language translation algorithm that can beat many if not all rule based algorithms. It was developed by applying statistical pattern processing to very large sets of “Rosetta stone” type documents, that had been previously translated. Because it is not rule based it avoids some of the early auto translation pit falls such as translating “hydraulic ram” as a “male water sheep.”

SimpleDB, with its free unstructured approach to adding attributes, let’s me add any number of additional alternateNames attributes in whatever language UTF-8 character set I wish.

Although this works nicely for point features, more complex spatial features are unsuited to SimpleDB. The limit of 256 attribute per item and 1024 byte per attribute precludes arbitrary length polyline or polygon geometry. Perhaps Amazon SimpleDB 2.0 will let attributes be arbitrary length, which means polyline and polygon geometries could be added along with a bbox for intersect queries.

Still it is an interesting approach for storing and viewing point data.

SimpleDB and locations

SimpleDB GeoRSS
Fig 1 – SimpleDB GeoRSS locations.

GeoRSS from SimpleDB

Amazons SimpleDB service is intriguing because it hints at the future of Cloud databases. Cloud databases need to be at least “tolerant of network partitions,” which leads inevitably to Werner Vogel’s “eventually consistent” cloud data. See previous blog post on Cloud Data. Cloud data is moving toward the scalability horizon discovered by Google. Last week’s announcement on AWS, Elastic Map Reduce, is another indicator of moving down the road toward infinite scalability.

SimpleDB is an early adopter of data in the Cloud and is somewhat unlike the traditional RDBMS. My interest is how the SimpleDB data approach might be used in a GIS setting. Here is my experiment in a nutshell:

  1. Add GeoNames records to a SimpleDB domain
  2. See what might be done with Bounding Box queries
  3. Export queries as GeoRSS
  4. Try multiple attributes for geographic alternate names
  5. Show query results in a viewer is a creative commons attribution license collection of GNS, GNIS, and other named point resources with over 8 million names. Since SimpleDB beta allows a single domain to grow up to 10 GB, the experiment should fit comfortably even if I later want to extend it to all countries. Calculating a rough estimate on a name item uses this forumla:
Raw byte size (GB) of all item IDs + 45 bytes per item + Raw byte size (GB) of all attribute names + 45 bytes per attribute name + Raw byte size (GB) of all attribute-value pairs + 45 bytes per attribute-value pair.

I chose a subset of 7 attributes from the GeoNames source <name, alternatenames, latitude, longitude, feature class, feature code, country code>
leading to this rough estimate of storage space:

  • itemid 7+45 = 52
  • attribute names 73+7*45 = 388
  • attribute values average 85 + 7*45 =400
  • total = 840bytes per item x 8000000 = 6.72 Gb

For experimental purposes I used just the Colombia tab delimited names file. There are 57,714 records in the Colombia, CO.txt, names file, which should be less than 50Mb. I chose a spanish language country to check that the utf-8 encoding worked properly.
2593108||Loma El Águila||Loma El Aguila||||5.8011111||7.2833333||T||HLL||CO||||36||||||||0||||151||America/Bogota||2002-02-25

Here are some useful links I used to get started with SimpleDB:
  Developer quide

I ran across this very “simple” SimpleDB code: ‘Simple’ SimpleDB code in single Java file/class (240 lines) This Java code was enhanced to add Map collections for Put and Get Attribute commands by Alan Williamson. I had to make some minor changes to allow for multiple duplicate key entries in the HashMap collections. I wanted to have the capability of using multiple “name” attributes for accomodating alternate names and then eventually alternate translations of names, so Map<String, ArrayList> replaces Map<String, String>

However, once I got into my experiment a bit I realized the limitations of urlencoded Get calls prevented loading the utf-8 char set found in Colombia’s spanish language names. I ended up reverting to the Java version of Amazon’s SimpleDB sample library. I ran into some problems since the Amazon’s SimpleDB sample library referenced jaxb-api.jar 2.1 and my local version of Tomcat used an older 2.0 version. I tried some of the suggestions for adding jaxb-api.jar to /lib/endorsed subdirectory, but in the end just upgrading to the latest version of Tomcat, 6.0.18, fixed my version problems.

One of the more severe limitations of SimpleDB is the single type “String.” To be of any use in a GIS application I need to do Bounding Box queries on latitude,longitude. The “String” type limitation carries across to queries by limiting them to lexicographical ordering. See: SimpleDB numeric encoding for lexicographic ordering In order to do a Bounding Box query with a lexicographic ordering we have to do some work on the latitude and longitude. AmazonSimpleDBUtil includes some useful utilities for dealing with float numbers.
  String encodeRealNumberRange(float number, int maxDigitsLeft, int maxDigitsRight, int offsetValue)
  float decodeRealNumberRangeFloat(String value, int maxDigitsRight, int offsetValue)

Using maxDigitsLeft 3, maxDigitsRight 7, along with offset 90 for latitude and offset 180 for longitude, encodes this lat,lon pair (1.53952, -72.313633) as (“0915395200″, “1076863670″) Basically these are moving a float to positive integer space and zero filling left and right to make the results fit lexicographic ordering.

Now we can use a query that will select by bounding box even with the limitation of a lexicographic ordering. For example Bbox(-76.310031, 3.889343, -76.285419, 3.914497) translates to this query:
Select * From GeoNames Where longitude > “1036899690″ and longitude < “1037145810″ and latitude > “0938893430″ and latitude < “0939144970″

Once we can select by an area of interest what is the best way to make our selection available? GeoRSS is a pretty simple XML feed that is consumed by a number of map viewers including VE and OpenLayer. Simple format point entries look like this:<georss:point>45.256 -71.92</georss:point> So we just need an endpoint that will query our GeoNames domain for a bbox and then use the result to create a GeoRSS feed.

<?xml version=”1.0″ encoding=”utf-8″?>
<feed xmlns=””
<title>GeoNames from SimpleDB</title>
<subtitle>Experiment with GeoNames in Amazon SimpleDB</subtitle>
<link href=””/>
<name>Randy George</name>
<title>Resguardo Indígena Barranquillita</title>
<description><![CDATA[<a href="" target="_blank">feature class</a>:L <a
href="" target="_blank">feature code</a>
:RESV <a
href="" target="_blank">country code</a>:CO ]]></description>
<georss:point>1.53952 -72.313633</georss:point>

There seems to be some confusion about GeoRSS mime type – application/xml, or text/xml, or application/rss+xml, or even application/georss+xml show up in a brief google search? In the end I used a Virtual Earth api viewer to consume the GeoRSS results, which isn’t exactly known for caring about header content anyway. I worked for awhile trying to get the GeoRSS acceptable to OpenLayers.Layer.GeoRSS but never succeeded. It easily accepted static .xml end points, but I never was able to get a dynamic servlet endpoint to work. I probably didn’t find the correct mime type.

The Amazon SimpleDB Java library makes this fairly easy. Here is a sample of a servlet using Amazon’s approach.

Listing 1 – Example Servlet to query SimpleDB and return results as GeoRSS

This example servlet makes use of the nextToken to extend the query results past the 5s limit. There is also a limit to the number of markers that can be added in the VE sdk. From the Amazon website:
“Since Amazon SimpleDB is designed for real-time applications and is optimized for those use cases, query execution time is limited to 5 seconds. However, when using the Select API, SimpleDB will return the partial result set accumulated at the 5 second mark together with a NextToken to restart precisely from the point previously reached, until the full result set has been returned. “

I wonder if the “5 seconds” indicated in the Amazon quote is correct, as none of my queries seemed to take that long even with multiple nextTokens.

You can try the results here: Sample SimpleDB query in VE


SimpleDB can be used for bounding box queries. The response times are reasonable even with the restriction of String only type and multiple nextToken SelectRequest calls. Of course this is only a 57000 item domain. I’d be curious to see a plot of domain size vs query response. Obviously at this stage SimpleDB will not be a replacement for a geospatial database like PostGIS, but this experiment does illustrate the ability to use SimpleDB for some elementary spatial queries. This approach could be extended to arbitrary geometry by storing a bounding box for lines or polygons stored as SimpleDB Items. By adding additional attributes for llx,lly,urx,ury in lexicographically encoded format, arbitrary bbox selections could return all types of geometry intersecting the selection bbox.

Select * From GeoNames Where (llx > “1036899690″ and llx < “1037145810″ and lly > “0938893430″ and lly < “0939144970″)
or (urx > “1036899690″ and urx < “1037145810″ and ury > “0938893430″ and ury < “0939144970″)

Unfortunately, Amazon restricts attributes to 1024 bytes, which complicates storing vertex arrays. This practically speaking limits geometries to point data.

The only advantage offered by SimpleDB is extending the scalability horizon, which isn’t likely to be a problem with vector data.