Visualizing large data sets with maps is an ongoing concern these days. Just ask the NSA, or note this federal vehicle tracking initiative reported at the LA Times. Or, this SPD mesh network for tracking any MAC address wandering by.
“There was of course no way of knowing whether you were being watched at any given moment. How often, or on what system, the Thought Police plugged in on any individual wire was guesswork. It was even conceivable that they watched everybody all of the time. But at any rate they could plug in your wire whenever they wanted to.”
George Orwell, 1984
On a less intrusive note, large data visualization is also of interest to anyone dealing with BI or just fascinated with massive public data sets such as twitter universe. Web maps are the way to go for public distribution and all web apps face the same set of issues when dealing with large data sets.
1. Latency of data storage queries, typically SQL
2. Latency of services for mediating queries and data between the UI and storage.
3. Latency of the internet
4. Latency of client side rendering
Web Mapping Limitation
Although this is a powerful mapping API rendering performance degrades with the number of vector entities in an overlay. Zoom and pan navigation performs smoothly on a typical client up to a couple of thousand points or a few hundred complex polylines and polygons. Beyond these limits other approaches are needed for visualizing geographic data sets. This client side limit is necessarily fuzzy as there is a wide variety of client hardware out there in user land, from older desktops and mobile phones to powerful gaming rigs.
Large data Visualization Approaches
1) Tile Pyramid – The Bing Maps Ajax v7 API offers a tileLayer resource that handles overlays of tile pyramids using a quadkey nomenclature. Data resources are precompiled into sets of small images called a tile pyramid which can then be used in the client map as a slippy tile overlay. This is the same slippy tile approach used for serving base Road, Aerial, and AerialWithLabels maps, similar to all web map brands.
Pro:· Fast performance
- Server side latency is eliminated by pre-processing tile pyramids
- Internet streaming is reduced to a limited set of png or jpg tile images
- Client side rendering is reduced to a small set of images in the overlay
Con: Static data – tile pyramids are pre-processed
- data cannot be real time
- Permutations limited – storage and time limitations apply to queries that have large numbers of permutations
- Storage capacity – tile pyramids require large storage resources when provided for worldwide extents and full 20 zoom level depth
2) Dynamic tiles – this is a variation of the tile pyramid that creates tiles on demand at the service layer. A common approach is to provide dynamic tile creation with SQL or file based caching. Once a tile has been requested it is then available for subsequent queries directly as an image. This allows lower levels of the tile pyramid to be populated only on demand reducing the amount of storage required.
- Can handle larger number of query permutations
- Server side latency is reduced by caching tile pyramid images (only the first request requires generating the image)
- Internet streaming is reduced to a limited set of png tile images
- Client side rendering is reduced to a small set of images in the overlay
- Static data – dynamic data must still be refreshed in the cache
- Tile creation performance is limited by server capability and can be a problem with public facing high usage websites.
3) Hybrid - This approach splits the zoom level depth into at least two sections. The lowest levels with the largest extent contain the majority of a data set’s features and is provided as a static tile pyramid. The higher zoom levels comprising smaller extents with fewer points can utilize the data as vectors. A variation of the hybrid approach is a middle level populated by a dynamic tile service.
- Fast performance – although not as fast as a pure static tile pyramid it offers good performance through the entire zoom depth.
- Allows fully event driven vectors at higher zoom levels on the bottom end of the pyramid.
- Static data at larger extents and lower zoom levels
- Event driven objects are only available at the bottom end of the pyramid
Heatmaps refer to the use of color gradient or opacity overlays to display data density. The advantage of heat maps is the data reduction in the aggregating algorithm. To determine the color/opacity of a data set at a location the data is first aggregated by either a polygon or a grid cell. The sum of the data in a given grid cell is then applied to the color gradient dot for that cell. If heatmaps are rendered client side they have good performance only up to the latency limits of service side queries, internet bandwidth, and local rendering.
Grid Pyramids – Server side gridding
Hybrid server side gridding offers significant performance advantages when coupled with pre-processed grid cells. One technique of gridding processes a SQL data resource into a quadkey structure. Each grid cell is identified by its unique quadkey and contains the data aggregate at that grid cell. A grid quadkey sort by length identifies all of the grid aggregates at a specific quadtree level. This allows the client to efficiently download the grid data aggregates at each zoom level and render locally on the client in an html5 canvas over the top of a Bing Maps view. Since all grid levels are precompiled, resolution of cells can be adjusted by Zoom Level.
- Efficient display of very large data sets at wide extents
- Can be coupled with vector displays at higher zoom levels for event driven objects
Con: gridding is pre-processed
- real time data cannot be displayed
- storage and time limitations apply to queries that have large numbers of permutations
Thematic maps use spatial regions such as states or zipcodes to aggregate data into color coded polygons. Data is aggregated for each region and color coded to show value. A hierarchy of polygons allows zoom levels to switch to more detailed regions at closer zooms. An example hierarchy might be Country, State, County, Sales territory, Zipcode, Census Block.
- Large data resources are aggregated into meaningful geographic regions.
- Analysis is often easier using color ranges for symbolizing data variation
- Rendering client side is limited to a few hundred polygons
- Very large data sets require pre-processing data aggregates by region
6) Future trends
Big Data visualization is an important topic as the web continues to generate massive amounts of data useful for analysis. There are a couple of technologies on the horizon that help visualization of very large data resources.
A. Leverage of client side GPU
This sample shows speed of pan zoom rendering of 30,000 random points which would overwhelm typical js shape rendering. Data performance is good up to about 500,000 points per Brendan Kenny. Complex shapes need to be built up from triangle primitives. Tessellation rates for polygon generation approaches 1,000,000 triangles per 1000ms using libtess. Once tessellated the immediate mode graphic pipeline can navigate at up to 60fps. Sample code is available on github.
This performance is achieved by leveraging the client GPU. Because immediate mode graphics is a powerful animation engine, time animations can be used to uncover data patterns and anomalies as well as making some really impressive dynamic maps like this Uber sample. Unfortunately all the upstream latency remains: collecting the data from storage and sending it across the wire. Since we’re talking about larger sets of data this latency is more pronounced. Once data initialization finishes, client side performance is amazing. Just don’t go back to the server for new data very often.
- Good client side navigation performance up to about 500,000 points
- requires a webgl enabled browser
- requires GPU on the client hardware
- subject to latency issues of server query and internet streaming
- WebGL tessellation triangle primitives make display of polylines and polygons complex
Note: IE11 added WebGL capability which is a big boost for the web. There are still some glitches, however, and gl_PointSize in shader is broken for simple points like this sample.
B. Leverage of server side GPU
MapD – Todd Mostak has developed a GPU based spatial query system called MapD (Massively Parallel Database)
MapD is a new database in development at MIT, created by Todd Mostak. MapD stands for “massively parallel database.” The system uses graphics processing units (GPUs) to parallelize computations. Some statistical algorithms run 70 times faster compared to CPU-based systems like MapReduce. A MapD server costs around $5,000 and runs on the same power as five light bulbs. MapD runs at between 1.4 and 1.5 teraflops, roughly equal to the fastest supercomputer in 2000. uses SQL to query data. Mostak intends to take the system open source sometime in the next year.
Bing Test shows an example of tweet points over Bing Maps and illustrates the performance boost from the MapD query engine. Each zoom or pan results in a GetMap request to the MapD engine that queries millions of tweet point records (81 million tweets Oct 19 – Oct 30), generating a viewport png image for display over Bing Map. The server side query latency is amazing considering the population size of the data. Here are a couple of screen capture videos to give you the idea of the higher fps rates:
Interestingly, IE and FireFox handle cache in such a way that animations up to 100fps are possible. I can set a low play interval of 10ms and the player appears to do nothing. However, 24hr x12 days = 288 images are all being downloaded in just a few seconds. Consequently the next time through the play range the images come from cache and animation is very smooth. Chrome handles local cache differently in Windows 8 and it won’t grab from cache the second time. In the demo case the sample runs at 500ms or 2fps which is kind of jumpy but at least it works in Windows 8 Chrome with an ordinary internet download speed of 8Mbps
Demo site for MapD: http://mapd.csail.mit.edu/
- Server side performance up to 70x
- Internet stream latency reduced to just the viewport image overlay
- Client side rendering as a single image overlay is fast
- Source code not released, and there may be proprietary license restrictions
- Most web servers do not include GPU or GPU clusters – especially cloud instances
Note: Amazon AWS offers GPU Clusters but not cheap.
Cluster GPU Quadruple Extra Large 22 GiB memory, 33.5 EC2 Compute Units, 2 x NVIDIA Tesla “Fermi” M2050 GPUs, 1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet( $2.100 per Hour)
NVidia Tesla M2050 – 448 CUDA Cores per GPU and up to 515 Gigaflops of double-precision peak performance in each GPU!
C. Spatial Hadoop – http://spatialhadoop.cs.umn.edu/
Spatial Hadoop applies the parallelism of Hadoop clusters to spatial problems using the MapReduce technique made famous by Google. In the Hadoop world a problem space is distributed across multiple CPUs or servers. Spatial Hadoop adds a nice collection of spatial objects and indices. Although Azure Hadoop supports .NET, there doesn’t seem to be a spatial Hadoop in the works for .NET projects. Apparently MapD as open source would leap frog Hadoop clusters at least for performance per dollar.
D. In Memory database (SQL Server 2014 Hekatron in preview release) – Microsoft plans to enhance the next version of SQL Server with in-memory options. SQL server 2014 in-memory options allows high speed queries for very large data sets when deployed to high memory capacity servers.
Current SQL Server In-Memory OLTP CTP2
Specifying that the table is a memory-optimized table is done using the MEMORY_OPTIMIZED = ON clause. A memory-optimized table can only have columns of these supported datatypes:
- All integer types: tinyint, smallint, int, bigint
- All money types: money, smallmoney
- All floating types: float, real
- date/time types: datetime, smalldatetime, datetime2, date, time
- numeric and decimal types
- All non-LOB string types: char(n), varchar(n), nchar(n), nvarchar(n), sysname
- Non-LOB binary types: binary(n), varbinary(n)
Since geometry and geography data types are not supported with the next SQL Server 2014 in-memory release, spatial data queries will be limited to point (lat,lon) float/real data columns. It has been previously noted that for point data, float/real columns have equivalent or even better search performance than points in a geography or geometry form. In-memory optimizations would then apply primarily to spatial point sets rather than polygon sets.
“Natively Compiled Stored Procedures The best execution performance is obtained when using natively compiled stored procedures with memory-optimized tables. However, there are limitations on the Transact-SQL language constructs that are allowed inside a natively compiled stored procedure, compared to the rich feature set available with interpreted code. In addition, natively compiled stored procedures can only access memory-optimized tables and cannot reference disk-based tables.”
SQL Server 2014 natively compiled stored procedures will not include any spatial functions. This means optimizations at this level will also be limited to float/real lat,lon column data sets.
For fully spatialized in-memory capability we’ll probably have to wait for SQL Server 2015 or 2016.
- Reduce server side latency for spatial queries
- Enhances performance of image based server side techniques
- Dynamic Tile pyramids
- images (similar to MapD)
- Heatmap grid clustering
- Thematic aggregation
- Requires special high memory capacity servers
- It’s still unclear what performance enhancements can be expected from spatially enabled tables
The trends point to a hybrid solution in the future which addresses the server side query bottleneck as well as client side navigation rendering bottleneck.
Server side –
a. In-Memory spatial DB
b. Or GPU based parallelized queries
Client side – GPU enhanced with some version of WebGL type functionality that can makes use of client GPU
Techniques are available today that can accommodate large date resources in Bing Maps. Trends indicate that near future technology can really increase performance and flexibility. Perhaps the sweet spot for Big Data map visualization over the next few years will look like a MapD or a GPU Hadoop engine on the server communicating to a WebGL UI over 1 gbps fiber internet.
Orwell feared that we would become a captive audience. Huxley feared the truth would be drowned in a sea of irrelevance.
Amusing Ourselves to Death, Neil Postman
Of course, in America, we have to have the best of both worlds. Here’s my small contribution to irrelevance: