One of the areas of interest using SimpleDB is the ability to add multiple attribute values. Here is the overview from Amazon’s Service Highlights.
“Flexible – With Amazon SimpleDB, it is not necessary to pre-define all of the data formats you will need to store; simply add new attributes to your Amazon SimpleDB data set when needed, and the system will automatically index your data accordingly. The ability to store structured data without first defining a schema provides developers with greater flexibility when building applications, and eliminates the need to re-factor an entire database as those applications evolve.”
As an extension to my previous blog post I decided to try adding alternative language names:
French – Albuquerque, Nouveau-Mexique
Portuguese – Albuquerque Novo México
Japanese ニューメキシコ州アルバカーキ
Chinese traditional 美國新墨西哥州阿爾伯克基
Arabic البوكيرك (نيو مكسيكو
Russian Альбукерке, Нью-Мексико
Using Amazon’s sdb library again, allows adding additional attributes to an individual item:
AmazonSimpleDB service = new AmazonSimpleDBClient(accessKeyId, secretAccessKey);
try {
HashMap hm = new HashMap();
List<ReplaceableAttribute> attributeListGeoName = new ArrayList<ReplaceableAttribute>(1);
attributeListGeoName.add(new ReplaceableAttribute(attrName, attrValue, false));
PutAttributesRequest request = new PutAttributesRequest(domainName, itemID, attributeListGeoName);
invokePutAttributes(service, request);
} catch (Exception e) {e.printStackTrace();}
After adding several languages for ‘Albuquerque, New Mexico,’ I am able to display them as GeoRSS and then as tooltip text in the Virtual Earth API viewer.
I had to add an explicit character encoding to my http response like this: response.setCharacterEncoding(“UTF-8″);
Once that was done I could reliably get UTF-8 character strings in the GeoRSS xml returned to the viewer.
I am not multilingual, not even bilingual really, so where to go for alternate language translations? I had read about an interesting project over at Google: Language Tools
Here I could simply run a translate on my geoname for whatever languages are offered by Google. I cannot vouch for their accuracy, but I understand that Google has developed a statistically based language translation algorithm that can beat many if not all rule based algorithms. It was developed by applying statistical pattern processing to very large sets of “Rosetta stone” type documents, that had been previously translated. Because it is not rule based it avoids some of the early auto translation pit falls such as translating “hydraulic ram” as a “male water sheep.”
SimpleDB, with its free unstructured approach to adding attributes, let’s me add any number of additional alternateNames attributes in whatever language UTF-8 character set I wish.
Although this works nicely for point features, more complex spatial features are unsuited to SimpleDB. The limit of 256 attribute per item and 1024 byte per attribute precludes arbitrary length polyline or polygon geometry. Perhaps Amazon SimpleDB 2.0 will let attributes be arbitrary length, which means polyline and polygon geometries could be added along with a bbox for intersect queries.
Still it is an interesting approach for storing and viewing point data.
Amazons SimpleDB service is intriguing because it hints at the future of Cloud databases. Cloud databases need to be at least “tolerant of network partitions,” which leads inevitably to Werner Vogel’s “eventually consistent” cloud data. See previous blog post on Cloud Data. Cloud data is moving toward the scalability horizon discovered by Google. Last week’s announcement on AWS, Elastic Map Reduce, is another indicator of moving down the road toward infinite scalability.
SimpleDB is an early adopter of data in the Cloud and is somewhat unlike the traditional RDBMS. My interest is how the SimpleDB data approach might be used in a GIS setting. Here is my experiment in a nutshell:
Try multiple attributes for geographic alternate names
Show query results in a viewer
GeoNames.org is a creative commons attribution license collection of GNS, GNIS, and other named point resources with over 8 million names. Since SimpleDB beta allows a single domain to grow up to 10 GB, the experiment should fit comfortably even if I later want to extend it to all countries. Calculating a rough estimate on a name item uses this forumla: Raw byte size (GB) of all item IDs + 45 bytes per item + Raw byte size (GB) of all attribute names + 45 bytes per attribute name + Raw byte size (GB) of all attribute-value pairs + 45 bytes per attribute-value pair.
I chose a subset of 7 attributes from the GeoNames source <name, alternatenames, latitude, longitude, feature class, feature code, country code> leading to this rough estimate of storage space:
itemid 7+45 = 52
attribute names 73+7*45 = 388
attribute values average 85 + 7*45 =400
total = 840bytes per item x 8000000 = 6.72 Gb
For experimental purposes I used just the Colombia tab delimited names file. There are 57,714 records in the Colombia, CO.txt, names file, which should be less than 50Mb. I chose a spanish language country to check that the utf-8 encoding worked properly. 2593108||Loma El Águila||Loma El Aguila||||5.8011111||7.2833333||T||HLL||CO||||36||||||||0||||151||America/Bogota||2002-02-25
I ran across this very “simple” SimpleDB code: ‘Simple’ SimpleDB code in single Java file/class (240 lines) This Java code was enhanced to add Map collections for Put and Get Attribute commands by Alan Williamson. I had to make some minor changes to allow for multiple duplicate key entries in the HashMap collections. I wanted to have the capability of using multiple “name” attributes for accomodating alternate names and then eventually alternate translations of names, so Map<String, ArrayList> replaces Map<String, String>
However, once I got into my experiment a bit I realized the limitations of urlencoded Get calls prevented loading the utf-8 char set found in Colombia’s spanish language names. I ended up reverting to the Java version of Amazon’s SimpleDB sample library. I ran into some problems since the Amazon’s SimpleDB sample library referenced jaxb-api.jar 2.1 and my local version of Tomcat used an older 2.0 version. I tried some of the suggestions for adding jaxb-api.jar to /lib/endorsed subdirectory, but in the end just upgrading to the latest version of Tomcat, 6.0.18, fixed my version problems.
One of the more severe limitations of SimpleDB is the single type “String.” To be of any use in a GIS application I need to do Bounding Box queries on latitude,longitude. The “String” type limitation carries across to queries by limiting them to lexicographical ordering. See: SimpleDB numeric encoding for lexicographic ordering In order to do a Bounding Box query with a lexicographic ordering we have to do some work on the latitude and longitude. AmazonSimpleDBUtil includes some useful utilities for dealing with float numbers. String encodeRealNumberRange(float number, int maxDigitsLeft, int maxDigitsRight, int offsetValue)
float decodeRealNumberRangeFloat(String value, int maxDigitsRight, int offsetValue)
Using maxDigitsLeft 3, maxDigitsRight 7, along with offset 90 for latitude and offset 180 for longitude, encodes this lat,lon pair (1.53952, -72.313633) as (“0915395200″, “1076863670″) Basically these are moving a float to positive integer space and zero filling left and right to make the results fit lexicographic ordering.
Now we can use a query that will select by bounding box even with the limitation of a lexicographic ordering. For example Bbox(-76.310031, 3.889343, -76.285419, 3.914497) translates to this query: Select * From GeoNames Where longitude > “1036899690″ and longitude < “1037145810″ and latitude > “0938893430″ and latitude < “0939144970″
Once we can select by an area of interest what is the best way to make our selection available? GeoRSS is a pretty simple XML feed that is consumed by a number of map viewers including VE and OpenLayer. Simple format point entries look like this:<georss:point>45.256 -71.92</georss:point> So we just need an endpoint that will query our GeoNames domain for a bbox and then use the result to create a GeoRSS feed.
There seems to be some confusion about GeoRSS mime type – application/xml, or text/xml, or application/rss+xml, or even application/georss+xml show up in a brief google search? In the end I used a Virtual Earth api viewer to consume the GeoRSS results, which isn’t exactly known for caring about header content anyway. I worked for awhile trying to get the GeoRSS acceptable to OpenLayers.Layer.GeoRSS but never succeeded. It easily accepted static .xml end points, but I never was able to get a dynamic servlet endpoint to work. I probably didn’t find the correct mime type.
The Amazon SimpleDB Java library makes this fairly easy. Here is a sample of a servlet using Amazon’s SelectSample.java approach.
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import com.amazonaws.sdb.*;
import com.amazonaws.sdb.model.*;
import com.amazonaws.sdb.util.*;
public class GeoRSS extends HttpServlet {
/** Logger for this class and subclasses */
protected static final Log logger = LogFactory.getLog(GeoRSS.class);
private static final String accessKeyId = ;
private static final String secretAccessKey = ;
public GeoRSS() {
super();
}
public void destroy() {
super.destroy();
}
/*
* (non-Javadoc)
* @see javax.servlet.http.HttpServlet#doGet(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse)
*parameters:
* bbox: llx,lly,urx,ury
* domainname:
*/
public void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
String path = request.getContextPath();
String basePath = request.getScheme()+"://"+request.getServerName()+":"+request.getServerPort()+path;
logger.info("basePath="+basePath);
String llx = null;
String lly = null;
String urx = null;
String ury = null;
String nextToken = null;
String name = null;
String domainName = "GeoNames";//Default SimpleDB domain name
String bbox = "-76.310031,3.889343,-76.285419,3.914497";//Default bbox
String [] values;
Enumeration enumer = request.getParameterNames();
while (enumer.hasMoreElements()) {
name = (String)enumer.nextElement();
if (name.equals("bbox")) {
values = request.getParameterValues(name);
bbox = values[0];
}
else if (name.equals("domainname")) {
values = request.getParameterValues(name);
domainName = values[0];
}
}
logger.info("bbox: "+bbox);
//response.setContentType("application/rss+xml");
//response.setContentType("application/georss+xml");
response.setContentType("application/xml");
PrintWriter out = response.getWriter();
out.println(" ");
out.println(" ");
out.println(" ");
out.println(" Experiment with GeoNames in Amazon SimpleDB ");
out.println("
");
out.println(" 2005-12-13T18:30:02Z ");
out.println(" ");
out.println(" Randy George ");
out.println(" rkgeorge@cadmaps.com ");
out.println(" ");
String[] fields = bbox.split(",");
llx = AmazonSimpleDBUtil.encodeRealNumberRange(Float.parseFloat(fields[0]),3,7,180);
lly = AmazonSimpleDBUtil.encodeRealNumberRange(Float.parseFloat(fields[1]),3,7,90);
urx = AmazonSimpleDBUtil.encodeRealNumberRange(Float.parseFloat(fields[2]),3,7,180);
ury = AmazonSimpleDBUtil.encodeRealNumberRange(Float.parseFloat(fields[3]),3,7,90);
AmazonSimpleDB service = new AmazonSimpleDBClient(accessKeyId, secretAccessKey);
SelectRequest SimpleDBrequest = new SelectRequest();
String selectExpression = "Select * from " + domainName + " where longitude > '"+llx+"' and longitude < '"+urx+"' and latitude > '"+lly+"' and latitude < '"+ury+"'";
SimpleDBrequest.withSelectExpression(selectExpression);
do {
nextToken = invokeSelect(service, SimpleDBrequest, out);
SimpleDBrequest = new SelectRequest(selectExpression, nextToken);
} while(nextToken != null);
out.println("");
out.flush();
out.close();
}
public static String invokeSelect(AmazonSimpleDB service, SelectRequest request, PrintWriter out) {
String nextToken = null;
try {
float lat = 0.0f;
float lon = 0.0f;
String nameid = null;
String featureclass = null;
String featurecode = null;
String countrycode = null;
String name = null;
String alternatename = null;
SelectResponse response = service.select(request);
if (response.isSetSelectResult()) {
SelectResult selectResult = response.getSelectResult();
List itemList = selectResult.getItem();
for (Item item : itemList) {
if (item.isSetName()) {
nameid = item.getName();
}
List attributeList = item.getAttribute();
for (Attribute attribute : attributeList) {
if (attribute.isSetName()) {
if (attribute.getName().equals("longitude")){
if (attribute.isSetValue()) lon = AmazonSimpleDBUtil.decodeRealNumberRangeFloat(attribute.getValue(), 7, 180);
}
else if (attribute.getName().equals("latitude")){
if (attribute.isSetValue()) lat = AmazonSimpleDBUtil.decodeRealNumberRangeFloat(attribute.getValue(), 7, 90);
}
else if (attribute.getName().equals("name")){
if (attribute.isSetValue()) name = attribute.getValue();
}
else if (attribute.getName().equals("alternatenames")){
if (attribute.isSetValue()) alternatename = attribute.getValue();
}
else if (attribute.getName().equals("feature class")){
if (attribute.isSetValue()) featureclass = attribute.getValue();
}
else if (attribute.getName().equals("feature code")){
if (attribute.isSetValue()) featurecode = attribute.getValue();
}
else if (attribute.getName().equals("country code")){
if (attribute.isSetValue()) countrycode = attribute.getValue();
}
}
}
out.println(" ");
out.println(" ");
out.println(" 0) out.println(" alternate: "+alternatename+"");
out.println(" feature class:"+featureclass+"");
out.println(" feature code:"+featurecode+"");
out.println(" country code:"+countrycode+"");
out.println(" lat,lon:"+lat+","+lon+"");
out.println(" ]]>");
out.println(" ");
out.println(" "+lat+" " +lon+"");
out.println(" ");
}
if (selectResult.isSetNextToken()) {
nextToken = selectResult.getNextToken();
}
}
if (response.isSetResponseMetadata()) {
ResponseMetadata responseMetadata = response.getResponseMetadata();
if (responseMetadata.isSetRequestId()) {
//logger.info(responseMetadata.getRequestId());
}
if (responseMetadata.isSetBoxUsage()) {
//logger.info(responseMetadata.getBoxUsage());
}
}
} catch (AmazonSimpleDBException ex) {
logger.info("Caught Exception: " + ex.getMessage());
logger.info("Response Status Code: " + ex.getStatusCode());
logger.info("Error Code: " + ex.getErrorCode());
logger.info("Error Type: " + ex.getErrorType());
logger.info("Request ID: " + ex.getRequestId());
System.out.print("XML: " + ex.getXML());
}
return nextToken;
}
public String getServletInfo() {
return "GeoRSS from SimpleDB servlet";
}
public void init() throws ServletException {
}
}
Listing 1 – Example Servlet to query SimpleDB and return results as GeoRSS
This example servlet makes use of the nextToken to extend the query results past the 5s limit. There is also a limit to the number of markers that can be added in the VE sdk. From the Amazon website: “Since Amazon SimpleDB is designed for real-time applications and is optimized for those use cases, query execution time is limited to 5 seconds. However, when using the Select API, SimpleDB will return the partial result set accumulated at the 5 second mark together with a NextToken to restart precisely from the point previously reached, until the full result set has been returned. “
I wonder if the “5 seconds” indicated in the Amazon quote is correct, as none of my queries seemed to take that long even with multiple nextTokens.
SimpleDB can be used for bounding box queries. The response times are reasonable even with the restriction of String only type and multiple nextToken SelectRequest calls. Of course this is only a 57000 item domain. I’d be curious to see a plot of domain size vs query response. Obviously at this stage SimpleDB will not be a replacement for a geospatial database like PostGIS, but this experiment does illustrate the ability to use SimpleDB for some elementary spatial queries. This approach could be extended to arbitrary geometry by storing a bounding box for lines or polygons stored as SimpleDB Items. By adding additional attributes for llx,lly,urx,ury in lexicographically encoded format, arbitrary bbox selections could return all types of geometry intersecting the selection bbox.
Select * From GeoNames Where (llx > “1036899690″ and llx < “1037145810″ and lly > “0938893430″ and lly < “0939144970″)
or (urx > “1036899690″ and urx < “1037145810″ and ury > “0938893430″ and ury < “0939144970″)
Unfortunately, Amazon restricts attributes to 1024 bytes, which complicates storing vertex arrays. This practically speaking limits geometries to point data.
The only advantage offered by SimpleDB is extending the scalability horizon, which isn’t likely to be a problem with vector data.