LiDAR processing with Hadoop on EC2

I was inspired yesterday by a FRUGOS talk from a Lidar production manager showing how to use some OpenSource tools to make processing LiDAR easier. LiDAR stands for LIght Detection And Ranging. It’s a popular technology these days for developing detailed terrain models and ground classification. The USGS has some information here: USGS CLick and the state of PA has an ambitious plan to fly the entire state every three years at a 1.4 ft resolution. PAMAP LiDAR

My first stop was to download some .las data from the PAMAP site. This PA LiDAR has been tiled into 10000x10000ft sections which are roughly 75Mb each. First I wrote a .las translator so I could look at the data. The LAS 1.1 specification is about to be replaced by LAS 2.0 but in the meantime most available las data is still using the older spec. The spec contains a header with a space for variable length attributes and then the point data. Each point record contains 20bytes (at least in the PAMAP .las)  of information including x,y,z, as long values and a classification and reflectance value. The xyz data is scaled and offset by values in the header to obtain double values. The LiDAR sensor mounted on an airplane flies tracks across the are of interest while the LiDAR pulses at a set frequency. The location of the reading can then be determined by post processing against the GPS location of the instrument. The pulse scans result in tracks that angle across the flight path.

Since the PAMAP data is available in PA83-SF coordinates, despite the metric unit shown in the meta data, I also added a conversion to UTM83-17. This was a convenience for displaying over Terraserver backgrounds using the available EPSG:26917. Given the hi res aerial data available from PAMAP this step may be unnecessary in a later iteration.

Actually PAMAP has done this step already. Tiff grids are available for download. However, I’m interested in the gridding algorithm and possible implementation in a hadoop cluster so PAMAP is a convenient test set with a solution already available for comparison. It is also nice to think about going directly from raw .las data to my end goal of a set of image and mesh pyramids for the 3D viewer I am exploring for gmti data.


Fig 1 – LiDAR scan tracks in xaml over USGS DOQ

Although each section is only about 3.5 square miles the scans generate more than a million points per square mile. The pulse rate achieves roughly 1.4ft spacing for the PAMAP data collection. My translator produces a xaml file of approximately 85Mb which is much too large to show in a browser. I just pulled some scans from the beginning and the end to show location and get an idea of scale.

The interesting aspect of LiDAR is the large amount of data collected which is accessible to the public. In order to really make use of this in a web viewer I will need to subtile and build sets of image pyramids. The scans are not orthogonal so the next task will be to grid the raw data into an orthoganal cell set. Actually there will be three grids, one for the elevation, one for the classification, and one for the reflectance value. With gridded data sets, I will be able to build xaml 3D meshes overlaid with the reflectance or classification. The reflectance and classification will be translated to some type of image file, probably png.

Replacing the PA83-SF with a metric UTM base gives close to 3048mx3048m per PAMAP section.If the base cell starts at 1 meter, and using a 50cellx50cell patch, the pyramid will look like this:
dim cells/patch size  approx faces/section
1m 50x50  50mx50m  9,290,304
2m 50x50  100mx100m 2,322,576
4m 50x50  200mx200m   580,644
8m 50×50   400mx400m   145,161
16m 50×50   800mx800m 36,290
32m 50x50  1600mx1600m 9072
64m 50x50  3200mx3200m 2268
128m 50x50  6400mx6400m 567
256m 50x50  12800mx12800m 141

There will end up being two image pyramids and a 3D zaml mesh pyramid with the viewer able to move from one level to the next based on zoom scale. PAMAP also has high resolution aerial imagery which could be added as an additional overlay.

Again this is a large amount of data, so pursuing the AWS cluster thoughts from my previous posting it will be interesting to build a hadoop hdfs cluster. I did not realize that there is actually a hadoop public ami available with ec2 specific scripts that can initialize and terminate sets of EC2 instances. The hadoop cluster tools are roughly similar to the Google GFS with a chunker master and as many slaves as desired. Hadoop is an Apache project started by the Lucene team inspired by the Google GFS. The approach is useful in processing data sets larger than available disk space, or 160Gb in a small EC2 instance. The default instance limit for EC2 users is currently at 20. With permission this can be increased but going above 100 seems to be problematic at this stage. Using a hadoop cluster of 20 EC2 instances still provides a lot of horsepower for processing data sets like LiDAR. The cost works out to 20x$0.10 per hour = $2.00/hr which is quite reasonable for finite workflows. hadoop EC2 ami

This seems to be a feasible approach to doing gridding and pyramid building on large data resources like PAMAP. As PAMAP paves the way, other state and federal programs will likely follow with comprehensive sub meter elevation and classification available for large parts of the USA. Working out the browser viewing requires a large static set of pre-tiled images and xaml meshes. The interesting part of this adventure is building and utilizing a supercomputer on demand. Hadoop makes it easier but this is still speculative at this point.


Fig 1 – LiDAR scan tracks in xaml over USGS DRG

I still have not abandoned the pipline approach to WPS either. 52North has an early implementation of a WPS available written in Java. Since it can be installed as a war, it should be simple to startup a pool of EC2 instances using an ami with this WPS war installed. The challenge, in this case, is to make a front end in WPF that would allow the user to wire up the instance pool in different configurations. Each WPS would be configured with an atomistic service which is then fed by a previous service on up the chain to the data source. Conceptually this would be similar to the pipleline approach used in JAI which builds a process chain, but leaves it empty until a process request is initialized. At that point the data flows down through the process nodes and out at the other end. Conceptually this is quite simple, one AMI per WPS process and a master instance to push the data down the chain. The challenge will be a configuration interface to select a WPS atom at each node and wire the processes together.

I am open to suggestions on a reasonably useful WPS chain to use in experimentations.

AWS is an exciting technology and promises a lot of fun for computer junkies who have always had a hankering to play with supercomputing, but didn’t quite make the grade for a Google position.

Amazon SQS the poor man's super computer

EC2 and S3 are not the only AWS services of interest to the geospatial community. Amazon SQS Simple Queue Service is also quite interesting. I haven’t looked into it too far but unlimited locking message queues with large instance arrays is essentially a poor man’s supercomputer. For a certain scale of problem which can be replicated recursively into multiple subsets, parallel computing techniques have often been used. Numerous distributed computing projects come to mind, Active Distributed Computing Projects.

Perhaps AWS can be configured for short burst supercomputer problems in an economical fashion. By breaking a problem into enough small chunks and adding them to a set of SQS queues pointed at a configurable array of ami instances, voila, we have an AWS super computer! The EC2 instance array would pull data chunks out of a queue, process , and queue back to an aggregator instance. An interesting problem might be to determine whether such a scenario would be queue constrained or processing instance constrained. Amazon resources are not infinite: “If you wish to run more than 20 instances, please contact us at aws@amazon.com ” However, let’s imagine a utility computing environment of the future.

In the AWS of the future an instance array can be more like Deep Blue. A modest 32×32 array provides 1024 discrete process instances which is possibly within current limits, but a more ambitious 256×256 array at 65536 distinct instances would not be out of the question on the five year horizon.

In the geospatial arena there are numerous problems amenable to distributed processing. With the massive collection of geospatial imagery presently underway, collection and storage are already a large problem for NASA, NOAA, JPL, USGS etc. Add to this problem the issue of scientific exploration of these massive data sets and distributed computing may have a large role to play within the same 5 year horizon.

This week OGC announced final release of the Web Processing Service, WPS. OGC WPS press release The Web Processing Service spec provides a blue print for services to ask higher level questions like why?, how much?, and what if? The goal is to provide interchangeable service process algorithms that can potentially be chained into answers to these types of higher level questions. For example a lidar scene can be processed into a roughness measure using a convolution kernel. When the result is compared with other bands from hyperspectral sensors in some boolean operation the output could be used to answer the question: “how many acres of drought tolerant grassland lie within Kit Carson county?” There are at least two distinct functions 1) roughness calculation 2) boolean combination, possibly a 3rd to add all pixels in the expected range for a final area measure.

Now add a distributed compute model. The simplest is one process per instance. In this approach each analysis request gets its own EC2 instance. All processes run sequentially in the single dedicated instance. This is of course a big help and far different than the typical multi-request one server model. But now we can move down this stream another step or two.

Next why not one instance for each process step. In this case a queue connects to a downstream instance. Process one performs the convolution and as chunks/cells/tiles become available they are pushed into the SQS. Process two, the boolean union, picks chunks from the other end of the queue to build the end result from a series of boolean tile operations. The queue decouples the two processes so that asynchronous operations are possible. If the first process proceeds at twice the speed of the second process simply add another instance to the other end of the queue. In this scenario we have one request, two WPS processes, and perhaps 3 AMI instances. This improves things a bit, actually quite a bit. The cost per request has at least tripled but throughput has also been increased by close to the same factor.

Now comes a full blown distributed model. Like most array objects geospatial processes can be broken into smaller subsets and the same process replicated over an array of subsets in a parallel fashion. Now each step in the process chain can have an array of instances each working on a small chunk. These chunks feed into multiple queues directed down stream to process two which is also an array of instances. We now have supercomputing potential. Process one 32×32 array pool of instances feeding some set of queues connecting to a second 32×32 array pool of instances working on process two. At 1024 instances per process we can quickly see the current AWS is not going to be happy. The cost is now magnified by a factor of a thousand but only if the instance pools are maintained continuously. If the pools are only in use for the duration of the request the cost could potentially be in the same magnitude as the one process per instance architecture, while throughput is increased by the 1000 factor. Short burst supercomputing inside utility computing warehouses like AWS could be quite cost effective.

It is conceivable that some analysis chains will involve dozens of process steps over very large imagery sets. Harnessing the ephemeral instance creation of utility computing points toward solutions to complex WPS process chains in near real time all on the internet cloud. So SQS does have some interesting potential in the geospatial analysis arena.

A quick look at GoGrid


Fig 1 a sample ASP .NET 3.5 website running on a GoGrid server instance

GoGrid is a cloud service similar to AWS.( http://www.gogrid.com ) Just like Amazon's AWS EC2, the user starts a virtual server instance from a template and then uses the instance like a dedicated server. The cost is similar to AWS, starting at about $0.10 per hourfor a minimal server. The main difference from a user perspective is the addition of Windows servers and an easy to use control panel. The GoGrid control panel provides point and click setup of server clusters with even a hardware load balancer .

The main attraction for me is the availability of virtual Windows Servers. There are several Windows 2003 configuration templates as well as sets of RedHat or CentOS Linux templates:
· Windows 2003 Server (32 bit)/ IIS
· Windows 2003 Server (32 bit)/ IIS/ASP.NET/SQL Server 2005 Express Edition
· Windows 2003 Server (32 bit)/ SQL Server 2005 Express Edition
· Windows 2003 Server (32 bit)/ SQL Server 2005 Workgroup Edition
· Windows 2003 Server (32 bit)/ SQL Server 2005 Standard Edition

The number of templates is more limited than EC2 and I did not see a way to create custom templates. However, this limitation is offset by ease of management.
For my experiment I chose the Windows 2003 Server (32 bit)/ IIS/ASP.NET/SQL Server 2005Express Edition. This offered the basics I needed to serve a temporary ASP web application.

After signing up, I entered my GoGrid control panel. Here I can add a service by selecting from the option list.


Fig 2- GoGrid Control Panel

Filling out a form with the basic RAM, OS, and Image lets me add a WebbApp server to my panel. I could additionally add several WebAPP servers and configure a LoadBalancer along with a Backend Database server by similarly filling out Control Panel forms.This appears to take the AWS EC2 service a step further by letting typical scaling workflows be part of the front end GUI. Although scaling in this manner can be done in AWS it requires installation of a software Load Balancer on one of the EC2 instances and a manual setup process.


Fig 3 – example of a GoGrid WebAPP configuration form

Once my experimental server came on line I was able to RemoteDesktop into the server and begin configuring my WebAPP. I first installedthe Microsoft .NET 3.5 framework so I could make use of some of its new features. I then copied up a sample web application showing the use of a GoogleMap Earth mode control in a simple ASP interface. This is a display interface which is connected to a different database server for displaying GMTI results out of a PostGIS table.

Since I did not want to point a domain at this experimental server, I simply assigned the GoGrid IP to my IIS website. I ran into a slight problem here because the sample webapp was created using .NET 3.5System.Web.Extensions. The webapp was not able to recognize the extension configurations in my WebConfig file. I tried copying the System.Web.Extensions.dlls into my webapp bin file. However, I was still getting errors. I then downloaded the ASP Ajax control and installed it on the GoGrid server but still was unable to get the website to display. Finally I went back to Visual Studio and remade the webapp using the ASP.NET Web App template without the extensions. I was then able to upload to my GoGrid server and configure IIS to see my website as the default http service.

There was still one more problem. I could see the website from the local GoGrid system but not from outside. After contacting GoGrid support I was quickly in operation with a pointer to the Windows Firewall which GoGrid Support kindly fixed for me. The problem was that theWindows 2003 template I chose does not open port 80 by default. I needed to use the Firewall manager to open port 80 for the http service. For those wanting to use ftp the same would be required for port 21.

I now had my experimental system up and running. I had chosen a 1Gb memory server so my actual cost on the server is $0.19/hour which is a little less for your money than the AWS EC2:

$0.10Small Instance (Default)
1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform

But again, running ASP .NET 3.5 is much more complex on EC2, requiring a Mono installation on a Linux base. I have not yet tried that combination and somehow doubt that it would work with a complex ASP .NET 3.5 website, especially with Ajax controls.

The GoogleMap Control with the Earth mode was also interesting. I had not yet embedded this into an ASP website. It proved to be fairly simple. I just needed to add a <asp:ScriptManager ID=”mapscriptmanager” runat=”server”/> to my site Master page and then the internal page javascript used to create the GoogleMap Control worked as normal.

I had some doubts about accessing the GMTI points from the webapp since often there are restrictions using cross domain xmlhttpRequests. There was no problem. My GMTI to KML servlet produces kml mime type "application/vnd.google-earth.kml+xml" which is picked up in the client javascript using the Google API:·
geoXml = new GGeoXml(url);

Evidently cross domain restrictions did not apply in this case, which made me happy, since I didn't have to write a proxy servlets just to access the gmti points on a different server.

In Summary GoGrid is a great cloud service which finally opens the cloud to Microsoft shops. The GUI control panel is easy to use and configuring a fully scalable load balanced cluster can be done right from the control panel. GoGrid fills a big hole in the cloud computing world.