LiDAR processing with Hadoop on EC2

I was inspired yesterday by a FRUGOS talk from a Lidar production manager showing how to use some OpenSource tools to make processing LiDAR easier. LiDAR stands for LIght Detection And Ranging. It’s a popular technology these days for developing detailed terrain models and ground classification. The USGS has some information here: USGS CLick and the state of PA has an ambitious plan to fly the entire state every three years at a 1.4 ft resolution. PAMAP LiDAR

My first stop was to download some .las data from the PAMAP site. This PA LiDAR has been tiled into 10000x10000ft sections which are roughly 75Mb each. First I wrote a .las translator so I could look at the data. The LAS 1.1 specification is about to be replaced by LAS 2.0 but in the meantime most available las data is still using the older spec. The spec contains a header with a space for variable length attributes and then the point data. Each point record contains 20bytes (at least in the PAMAP .las)  of information including x,y,z, as long values and a classification and reflectance value. The xyz data is scaled and offset by values in the header to obtain double values. The LiDAR sensor mounted on an airplane flies tracks across the are of interest while the LiDAR pulses at a set frequency. The location of the reading can then be determined by post processing against the GPS location of the instrument. The pulse scans result in tracks that angle across the flight path.

Since the PAMAP data is available in PA83-SF coordinates, despite the metric unit shown in the meta data, I also added a conversion to UTM83-17. This was a convenience for displaying over Terraserver backgrounds using the available EPSG:26917. Given the hi res aerial data available from PAMAP this step may be unnecessary in a later iteration.

Actually PAMAP has done this step already. Tiff grids are available for download. However, I’m interested in the gridding algorithm and possible implementation in a hadoop cluster so PAMAP is a convenient test set with a solution already available for comparison. It is also nice to think about going directly from raw .las data to my end goal of a set of image and mesh pyramids for the 3D viewer I am exploring for gmti data.


Fig 1 – LiDAR scan tracks in xaml over USGS DOQ

Although each section is only about 3.5 square miles the scans generate more than a million points per square mile. The pulse rate achieves roughly 1.4ft spacing for the PAMAP data collection. My translator produces a xaml file of approximately 85Mb which is much too large to show in a browser. I just pulled some scans from the beginning and the end to show location and get an idea of scale.

The interesting aspect of LiDAR is the large amount of data collected which is accessible to the public. In order to really make use of this in a web viewer I will need to subtile and build sets of image pyramids. The scans are not orthogonal so the next task will be to grid the raw data into an orthoganal cell set. Actually there will be three grids, one for the elevation, one for the classification, and one for the reflectance value. With gridded data sets, I will be able to build xaml 3D meshes overlaid with the reflectance or classification. The reflectance and classification will be translated to some type of image file, probably png.

Replacing the PA83-SF with a metric UTM base gives close to 3048mx3048m per PAMAP section.If the base cell starts at 1 meter, and using a 50cellx50cell patch, the pyramid will look like this:
dim cells/patch size  approx faces/section
1m 50x50  50mx50m  9,290,304
2m 50x50  100mx100m 2,322,576
4m 50x50  200mx200m   580,644
8m 50×50   400mx400m   145,161
16m 50×50   800mx800m 36,290
32m 50x50  1600mx1600m 9072
64m 50x50  3200mx3200m 2268
128m 50x50  6400mx6400m 567
256m 50x50  12800mx12800m 141

There will end up being two image pyramids and a 3D zaml mesh pyramid with the viewer able to move from one level to the next based on zoom scale. PAMAP also has high resolution aerial imagery which could be added as an additional overlay.

Again this is a large amount of data, so pursuing the AWS cluster thoughts from my previous posting it will be interesting to build a hadoop hdfs cluster. I did not realize that there is actually a hadoop public ami available with ec2 specific scripts that can initialize and terminate sets of EC2 instances. The hadoop cluster tools are roughly similar to the Google GFS with a chunker master and as many slaves as desired. Hadoop is an Apache project started by the Lucene team inspired by the Google GFS. The approach is useful in processing data sets larger than available disk space, or 160Gb in a small EC2 instance. The default instance limit for EC2 users is currently at 20. With permission this can be increased but going above 100 seems to be problematic at this stage. Using a hadoop cluster of 20 EC2 instances still provides a lot of horsepower for processing data sets like LiDAR. The cost works out to 20x$0.10 per hour = $2.00/hr which is quite reasonable for finite workflows. hadoop EC2 ami

This seems to be a feasible approach to doing gridding and pyramid building on large data resources like PAMAP. As PAMAP paves the way, other state and federal programs will likely follow with comprehensive sub meter elevation and classification available for large parts of the USA. Working out the browser viewing requires a large static set of pre-tiled images and xaml meshes. The interesting part of this adventure is building and utilizing a supercomputer on demand. Hadoop makes it easier but this is still speculative at this point.


Fig 1 – LiDAR scan tracks in xaml over USGS DRG

I still have not abandoned the pipline approach to WPS either. 52North has an early implementation of a WPS available written in Java. Since it can be installed as a war, it should be simple to startup a pool of EC2 instances using an ami with this WPS war installed. The challenge, in this case, is to make a front end in WPF that would allow the user to wire up the instance pool in different configurations. Each WPS would be configured with an atomistic service which is then fed by a previous service on up the chain to the data source. Conceptually this would be similar to the pipleline approach used in JAI which builds a process chain, but leaves it empty until a process request is initialized. At that point the data flows down through the process nodes and out at the other end. Conceptually this is quite simple, one AMI per WPS process and a master instance to push the data down the chain. The challenge will be a configuration interface to select a WPS atom at each node and wire the processes together.

I am open to suggestions on a reasonably useful WPS chain to use in experimentations.

AWS is an exciting technology and promises a lot of fun for computer junkies who have always had a hankering to play with supercomputing, but didn’t quite make the grade for a Google position.

Comments are closed.