Amazon released a handful of new stuff.
Fig 1 – Big Foot Memory
1. New Price for EC2 instances
Notice the small instance, now $0.12/hr, matches Azure Pricing
This is not really apples to apples since Amazon is a virtual instance, while Azure is per deployed application. A virtual instance can have multple service/web apps deployed.
2. Amazon announces a Relational Database Service RDS
Based on MySQL 5.1, this doesn’t appear to add a whole lot since you always could start an instance with any database you wanted. MySQL isn’t exactly known for geospatial even though it has some spatial capabilities. You can see a small comparison of PostGIS vs MySQL by Paul Ramsey. I don’t know if this comparison is still valid, but I haven’t seen much use of MySQL for spatial backends.
This is similar to Azure SQL Server which is also a convenience deployment that lets you run SQL Server as an Azure service, without all the headaches of administration and maintenance tasks. Neither of these options are cloud scaled, meaning that they are still single instance versions, not cross partition capable. SQL Azure Server CTP has an upper limit of 10Gb, as in hard drive not RAM.
3. Amazon adds New high memory instances
- High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform $1.20-$1.44/hr
- High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform $2.40-$2.88/hr
These are new virtual instance AMIs that scale up as opposed to scale out. Scaled out options use clusters of instances in the Grid Computing/Hadoop type of architectures. There is nothing to prohibit using clusters of scaled up instances in a hybridized architecture, other than cost. However, the premise of Hadoop arrays is “divide and conquer,” so it makes less sense to have massive nodes in the array. Since scaling out involves moving the problem to a whole new parallel programming paradigm with all of its consequent complexity, it also means owning the code. In contrast scaling up is generally very simple. You don’t have to own the code or even recompile just install on more capable hardware.
Returning us back to the Amazon RDS, Amazon has presumably taken an optimized compiled route and offers prepackaged MySQL 5.1 instances ready to use:
- db.m1.small (1.7 GB of RAM, $0.11 per hour).
- db.m1.large (7.5 GB of RAM, $0.44 per hour)
- db.m1.xlarge (15 GB of RAM, $0.88 per hour).
- db.m2.2xlarge (34 GB of RAM, $1.55 per hour).
- db.m2.4xlarge (68 GB of RAM, $3.10 per hour).
Of course the higher spatial functionality of PostgreSQL/PostGIS can be installed on any of these high memory instances as well. It is just not done by Amazon. The important thing to note is memory approaches 100Gb per instance! What does one do with all that memory?
Here is one use:
Having very large memory footprints makes sense for increasing performance on a DB application. Even fairly large data tables can reside entirely in memory for optimum performance. Whether a database makes use of the best optimized compiler for Amazon’s 64bit instances would need to be explored. Open source options like PostgreSQL/PostGIS would let you play with compiling in your choice of compilers, but perhaps not successfully.
Todd Hoff has some insightful analysis in his post, “Are Cloud-Based Memory Architectures the Next Big Thing?”
Here is Todd Hoff’s point about having your DB run inside of RAM – remember that 68Gb Quadruple Extra Large memory:
Wow! Can that be right? “Memory latency is 2,000 times faster .”
(Hmm… 13 milliseconds = 13,000,000 nanoseconds
so 13,000,000n/5n = 2,600,000x? And 5Gb/s / 100Mb/s = 50x? Am I doing the math right?)
The real question, of course, is what will actual benchmarks reveal? Presumably optimized memory caching narrows the gap between disk storage and RAM. Which brings up the problem of configuring a Database to use large RAM pools. PostgreSQL has a variety of configuration settings but to date RDBMS software doesn’t really have a configuration switch that simply caches the whole enchilada.
Here is some discussion of MySQL front-ending the database with In-Memory-Data-Grid (IMDG).
Here is an article on a PostgreSQL configuration to use a RAM disk.
Tuning for large memory is not exactly straightforward. There is no “one size fits all.” You can quickly get into Managing Kernel Resources. The two most important parameters are:
OK, given this rough guideline on a Quadruple Extra Large Instance 68Gb:
- shared_buffers = 17Gb (25%)
- sort_mem = 2.72Gb (4%)
This still leaves plenty of room, 48.28Gb, to avoid dreaded swap pagein by the OS. Let’s assume a more normal 8Gb memory for the OS. We still have 40Gb to play with. Looking at sort types in detail may make adding some more sort_mem helpful, maybe bump to 5Gb. Now there is still an additional 38Gb to drop into shared_buffers for a grand total of 55Gb. Of course you have to have a pretty hefty set of spatial tables to use up this kind of space.
Here is a list of PostgreSQL limitations. As you can see it is technically possible to run out of even 68Gb.
|Maximum Database Size||Unlimited|
|Maximum Table Size||32 TB|
|Maximum Row Size||1.6 TB|
|Maximum Field Size||1 GB|
|Maximum Rows per Table||Unlimited|
|Maximum Columns per Table||250 – 1600 depending on column types|
|Maximum Indexes per Table||Unlimited|
Naturally the Obe duo has a useful posting on determining PostGIS sizes: Determining size of database, schema, tables, and geometry
To get some perspective on size an Open Street Map dump of the whole world fits into a 90Gb EBS Amazon Public Data Set configured for PostGIS with pg_createcluster. Looks like this just happened a couple weeks ago. Although 90Gb is just a little out of reach for a for even a Quadruple Extra Large, I gather the current size of planet osm is still in the 60Gb range and you might just fit it into 55Gb RAM. It would be a tad tight. Well maybe the Octuple Extra Large Instance 136Gb instance is not too far off. Of course who knows how big Planet OSM will ultimately end up being.
Another point to notice is the 8 virtual cores in a Quadruple Extra Large Instance. Unfortunately
Running a single connection query apparently won’t benefit from a multi cpu virtual system, even though running multi threaded will definitely help with multiple connection pools.
I look forward to someone actually running benchmarks since that would be the genuine reality check.
Scaling up is the least complex way to boost performance on a lagging application. The Cloud offers lots of choices suitable to a range of budgets and problems. If you want to optimize personnel and adopt a decoupled SOA architecture, you’ll want to look at Azure + SQL Azure. If you want the adventure of large scale research problems, you’ll want to look at instance arrays and Hadoop clusters available in Amazon AWS.
However, if you just want a quick fix, maybe not 2000x but at least a some x, better take a look at Big RAM. If you do, please let us know the benchmarks!