SimpleDB and locations


SimpleDB GeoRSS
Fig 1 – SimpleDB GeoRSS locations.

GeoRSS from SimpleDB

Amazons SimpleDB service is intriguing because it hints at the future of Cloud databases. Cloud databases need to be at least “tolerant of network partitions,” which leads inevitably to Werner Vogel’s “eventually consistent” cloud data. See previous blog post on Cloud Data. Cloud data is moving toward the scalability horizon discovered by Google. Last week’s announcement on AWS, Elastic Map Reduce, is another indicator of moving down the road toward infinite scalability.

SimpleDB is an early adopter of data in the Cloud and is somewhat unlike the traditional RDBMS. My interest is how the SimpleDB data approach might be used in a GIS setting. Here is my experiment in a nutshell:

  1. Add GeoNames records to a SimpleDB domain
  2. See what might be done with Bounding Box queries
  3. Export queries as GeoRSS
  4. Try multiple attributes for geographic alternate names
  5. Show query results in a viewer

GeoNames.org is a creative commons attribution license collection of GNS, GNIS, and other named point resources with over 8 million names. Since SimpleDB beta allows a single domain to grow up to 10 GB, the experiment should fit comfortably even if I later want to extend it to all countries. Calculating a rough estimate on a name item uses this forumla:
Raw byte size (GB) of all item IDs + 45 bytes per item + Raw byte size (GB) of all attribute names + 45 bytes per attribute name + Raw byte size (GB) of all attribute-value pairs + 45 bytes per attribute-value pair.

I chose a subset of 7 attributes from the GeoNames source <name, alternatenames, latitude, longitude, feature class, feature code, country code>
leading to this rough estimate of storage space:

  • itemid 7+45 = 52
  • attribute names 73+7*45 = 388
  • attribute values average 85 + 7*45 =400
  • total = 840bytes per item x 8000000 = 6.72 Gb

For experimental purposes I used just the Colombia tab delimited names file. There are 57,714 records in the Colombia, CO.txt, names file, which should be less than 50Mb. I chose a spanish language country to check that the utf-8 encoding worked properly.
2593108||Loma El Águila||Loma El Aguila||||5.8011111||7.2833333||T||HLL||CO||||36||||||||0||||151||America/Bogota||2002-02-25

Here are some useful links I used to get started with SimpleDB:
  GettingStartedGuide
  Developer quide

I ran across this very “simple” SimpleDB code: ‘Simple’ SimpleDB code in single Java file/class (240 lines) This Java code was enhanced to add Map collections for Put and Get Attribute commands by Alan Williamson. I had to make some minor changes to allow for multiple duplicate key entries in the HashMap collections. I wanted to have the capability of using multiple “name” attributes for accomodating alternate names and then eventually alternate translations of names, so Map<String, ArrayList> replaces Map<String, String>

However, once I got into my experiment a bit I realized the limitations of urlencoded Get calls prevented loading the utf-8 char set found in Colombia’s spanish language names. I ended up reverting to the Java version of Amazon’s SimpleDB sample library. I ran into some problems since the Amazon’s SimpleDB sample library referenced jaxb-api.jar 2.1 and my local version of Tomcat used an older 2.0 version. I tried some of the suggestions for adding jaxb-api.jar to /lib/endorsed subdirectory, but in the end just upgrading to the latest version of Tomcat, 6.0.18, fixed my version problems.

One of the more severe limitations of SimpleDB is the single type “String.” To be of any use in a GIS application I need to do Bounding Box queries on latitude,longitude. The “String” type limitation carries across to queries by limiting them to lexicographical ordering. See: SimpleDB numeric encoding for lexicographic ordering In order to do a Bounding Box query with a lexicographic ordering we have to do some work on the latitude and longitude. AmazonSimpleDBUtil includes some useful utilities for dealing with float numbers.
  String encodeRealNumberRange(float number, int maxDigitsLeft, int maxDigitsRight, int offsetValue)
  float decodeRealNumberRangeFloat(String value, int maxDigitsRight, int offsetValue)

Using maxDigitsLeft 3, maxDigitsRight 7, along with offset 90 for latitude and offset 180 for longitude, encodes this lat,lon pair (1.53952, -72.313633) as (“0915395200″, “1076863670″) Basically these are moving a float to positive integer space and zero filling left and right to make the results fit lexicographic ordering.

Now we can use a query that will select by bounding box even with the limitation of a lexicographic ordering. For example Bbox(-76.310031, 3.889343, -76.285419, 3.914497) translates to this query:
Select * From GeoNames Where longitude > “1036899690″ and longitude < “1037145810″ and latitude > “0938893430″ and latitude < “0939144970″

Once we can select by an area of interest what is the best way to make our selection available? GeoRSS is a pretty simple XML feed that is consumed by a number of map viewers including VE and OpenLayer. Simple format point entries look like this:<georss:point>45.256 -71.92</georss:point> So we just need an endpoint that will query our GeoNames domain for a bbox and then use the result to create a GeoRSS feed.

<?xml version=”1.0″ encoding=”utf-8″?>
<feed xmlns=”http://www.w3.org/2005/Atom”
xmlns:georss=”http://www.georss.org/georss”>
<title>GeoNames from SimpleDB</title>
<subtitle>Experiment with GeoNames in Amazon SimpleDB</subtitle>
<link href=”http://www.cadmaps.com/”/>
<updated>2005-12-13T18:30:02Z</updated>
<author>
<name>Randy George</name>
<email>rkgeorge@cadmaps.com</email>
</author>
<entry>
<title>Resguardo Indígena Barranquillita</title>
<description><![CDATA[<a href="http://www.geonames.org/export/codes.html" target="_blank">feature class</a>:L <a
href="http://www.geonames.org/export/codes.html" target="_blank">feature code</a>
:RESV <a
href="http://ftp.ics.uci.edu/pub/websoft/wwwstat/country-codes.txt" target="_blank">country code</a>:CO ]]></description>
<georss:point>1.53952 -72.313633</georss:point>
</entry>
</feed>

There seems to be some confusion about GeoRSS mime type – application/xml, or text/xml, or application/rss+xml, or even application/georss+xml show up in a brief google search? In the end I used a Virtual Earth api viewer to consume the GeoRSS results, which isn’t exactly known for caring about header content anyway. I worked for awhile trying to get the GeoRSS acceptable to OpenLayers.Layer.GeoRSS but never succeeded. It easily accepted static .xml end points, but I never was able to get a dynamic servlet endpoint to work. I probably didn’t find the correct mime type.

The Amazon SimpleDB Java library makes this fairly easy. Here is a sample of a servlet using Amazon’s SelectSample.java approach.

Listing 1 – Example Servlet to query SimpleDB and return results as GeoRSS

This example servlet makes use of the nextToken to extend the query results past the 5s limit. There is also a limit to the number of markers that can be added in the VE sdk. From the Amazon website:
“Since Amazon SimpleDB is designed for real-time applications and is optimized for those use cases, query execution time is limited to 5 seconds. However, when using the Select API, SimpleDB will return the partial result set accumulated at the 5 second mark together with a NextToken to restart precisely from the point previously reached, until the full result set has been returned. “

I wonder if the “5 seconds” indicated in the Amazon quote is correct, as none of my queries seemed to take that long even with multiple nextTokens.

You can try the results here: Sample SimpleDB query in VE

Summary

SimpleDB can be used for bounding box queries. The response times are reasonable even with the restriction of String only type and multiple nextToken SelectRequest calls. Of course this is only a 57000 item domain. I’d be curious to see a plot of domain size vs query response. Obviously at this stage SimpleDB will not be a replacement for a geospatial database like PostGIS, but this experiment does illustrate the ability to use SimpleDB for some elementary spatial queries. This approach could be extended to arbitrary geometry by storing a bounding box for lines or polygons stored as SimpleDB Items. By adding additional attributes for llx,lly,urx,ury in lexicographically encoded format, arbitrary bbox selections could return all types of geometry intersecting the selection bbox.

Select * From GeoNames Where (llx > “1036899690″ and llx < “1037145810″ and lly > “0938893430″ and lly < “0939144970″)
or (urx > “1036899690″ and urx < “1037145810″ and ury > “0938893430″ and ury < “0939144970″)

Unfortunately, Amazon restricts attributes to 1024 bytes, which complicates storing vertex arrays. This practically speaking limits geometries to point data.

The only advantage offered by SimpleDB is extending the scalability horizon, which isn’t likely to be a problem with vector data.

Comments are closed.