Preparation for US 2020 Census is underway at this mid-decennial point and we’ll see activity ramping up over the next few years. Will 2020 be the last meaningful decennial demographic data dump? US Census has been a data resource since 1790. It took a couple centuries for Census data to migrate into the digital age, but by Census 2000, data started trickling into the internet community. At first this was simply a primitive ftp data dump, ftp2.census.gov, still very useful for developers, and finally after 2011 exposed as OGC WMS, TigerWeb UI, and ESRI REST.
However, static data in general, and decennial static data in particular, is fast becoming anachronistic in the modern era. Surely the NSA data tree looks something like phone number JOIN Facebook account JOIN Twitter account JOIN social security id JOIN bank records JOIN IRS records JOIN medical records JOIN DNA sequence….. Why should this data access be limited to a few black budget bureaus? Once the data tree is altered a bit to include mobile devices, static demographics are a thing of the past. Queries in 2030 may well ask “how many 34 year old male Hispanic heads of households with greater than 3 dependents with a genetic predisposition to diabetes are in downtown Denver Wed at 10:38AM, at 10:00PM?” For that matter let’s run the location animation at 10 minute intervals for Tuesday and then compare with Sat.
“Privacy? We don’t need no stinking privacy!”
I suppose Men and Black may find location aware DNA queries useful for weeding out hostile alien grays, but shouldn’t local cancer support groups also be able to ping potential members as they wander by Star Bucks? Why not allow soda vending machines to check for your diabetic potential and credit before offering appropriate selections? BTW How’s that veggie smoothie?
Back to Old School
By late 2011 census OGC services began to appear along with some front end data web UIs, and ESRI REST interfaces. [The ESRI connection is a tightly coupled symbiotic relationship as the Census Bureau, like many government bureaucracies, relies on ESRI products for both publishing and consuming data. From the outside ESRI could pass as an agency of the federal government. For better or worse “Arc this and that” are deeply rooted in the .gov GIS community.]
For mapping purposes there are two pillars of Census data, spatial and demographic. The spatial data largely resides as TIGER data while the demographic data is scattered across a large range of products and data formats. In the basement, a primary demographic resource is the SF1, Summary File 1, population data.
“Summary File 1 (SF 1) contains the data compiled from the questions asked of all people and about every housing unit. Population items include sex, age, race, Hispanic or Latino origin, household relationship, household type, household size, family type, family size, and group quarters. Housing items include occupancy status, vacancy status, and tenure (whether a housing unit is owner-occupied or renter-occupied).”
The intersection of SF1 and TIGER is the base level concern of census demographic mapping. There are a variety of rendering options, but the venerable color themed choropleth map is still the most widely recognized. This consists of assigning a value class to a color range and rendering polygons with their associated value color. This then is the root visualization of Census demographics, TIGER polygons colored by SF1 population classification ranges.
Unfortunately, access to this basic visualization is not part of the 2010 TigerWeb UI.
There are likely a few reasons for this, even aside from the glacially slow adoption of technology at the Bureau of the Census. A couple of obvious reasons are the sheer size of this data resource and the range of the statistics gathered. A PostGIS database with 5 level primary spatial hierarchy, all 48 SF1 population value files, appropriate indices, plus a few helpful functions consumes a reasonable 302.445 GB of a generic Amazon EC2 SSD elastic block storage. But, contained in those 48 SF1 tables are 8912 demographic values which you are welcome to peruse here. A problem for any UI is how to make 8912 plus 5 spatial levels usable.
Filling a gap
Since the Census Bureau budget did not include public visualization of TIGER/Demographics what does it take to fill in the gap? Census 2010 contains a large number of geographic polygons. The core hierarchy for useful demographic visualization is state, county, tract, block group, and block.
Loading the data into PostGIS affords low cost access to data for SF1 Polygon value queries such as this:
-- block tabblock
SELECT poly.GEOM, geo.stusab, geo.sumlev, geo.geocomp, geo.state, geo.county, geo.tract, geo.blkgrp, geo.block, poly.geoid10, sf1.P0010001, geo.stlogrecno
FROM tabblock poly
JOIN sf1geo geo ON geo.geoid10 = poly.geoid10
JOIN sf1_00001 sf1 ON geo.stlogrecno = sf1.stlogrecno
WHERE geo.geocomp='00' AND geo.sumlev = '101' AND ST_Intersects(poly.GEOM, ST_GeometryFromText('POLYGON ((-104.878035974004 38.9515291859429,-104.721023973742 38.9515291859429,-104.721023973742 39.063158980149,-104.878035974004 39.063158980149,-104.878035974004 38.9515291859429))', 4269))
ORDER BY geo.state, geo.county, geo.tract, geo.blkgrp, geo.block
Returning 1571 polygons in 1466 ms. Not too bad, but surely there’s room for improvement. Where is Paul Ramsey when you need him?
Really Old School – WMS
Some considerations on the data:
A. Queries become unwieldy for larger extents with large numbers of polygons
These polygon counts rule out visualizations of the entire USA, or even moderate regions, at tract+ levels of the hierarchy. Vector mapping is not optimal here.
B. The number of possible image tile pyramids for 8912 values over 5 polygon levels is 5 * 8192 = 44,560. This rules out tile pyramids of any substantial depth without some deep Google like pockets for storage. Tile pyramids are not optimal either.
C. Even though vector grid pyramids would help with these 44,560 demographic variations, they suffer from the same restrictions as A. above.
One possible compromise of performance/visualization is to use an old fashioned OGC WMS GetMap request scheme that treats polygon types as layer parameters and demographic types as style parameters. With appropriate use of WMS <MinScaleDenominator> <MaxScaleDenominator> the layers are only rendered at sufficient zoom to reasonably limit the number of polygons. Using this scheme puts rendering computation right next to the DB on the same EC2 instance, while network latency is reduced to simple jpeg/png image download. Scaling access to public consumption is still problematic, but for in-house it does work.
There are still issues with a scale rendering approach. Since population is not very homogenous over US coverage extent, scale dependent rendering asks to be variable as well. This is easily visible over population centers. Without some type of pre-calculated density grid, the query is already completed prior to knowledge of the ideal scale dependency. Consequently, static rendering scales have to be tuned to high population urban regions. Since “fly over” US is generally less interesting to analysts, we can likely live with this compromise.
Dividing a value curve to display adequately over a viewport range can be accomplished in a few different ways: equal intervals, equal quantile, jenks natural break optimization, K-means clustering, or “other.” Leaning toward the simpler, I chose a default quantile (guarantees some color) with a ten class single hue progression which of course is not recommended by color brewer. However 10 seems an appropriate number for decennial data. I also included a jenks classifier option which is considered a better representation. The classifier is based only on visible polygons rather than the entire polygon population. This means comparisons region to region are deceptive, but after all this is visualization of statistics.
“There are three kinds of lies: lies, damned lies, and statistics.” Mark Twain
In order to manage Census data on a personal budget these compromises are involved:
1. Only expose SF1 demographic data for 2010 i.e. 8912 population value types
2. Only provide primary level polygon hierarchy – state, county, tract, blockgroup, block
3. Code a custom OGC WMS service – rendering GetMap image on the server
4. Resolution scale rendering to limit polygon counts down the polygon hierarchy
5. Provide only quantile and Jenks classifier options
6. Classifier applied only to viewport polygon selection
This is a workable map service for a small number of users. Exposing as an OGC WMS service offers some advantages. First there are already a ton of WMS clients available to actually see the results. Second, the Query, geometry parsing, and image computation (including any required re-projection) are all server side on the same instance reducing network traffic. Unfortunately the downside is that the computation cost is significant and discouraging for a public facing service.
Scaling could be accomplished in a few ways:
1. Vertical scaling to a high memory EC2 R3 instance(s) and a memory tuned PostGIS
2. Horizontal auto scaling to multiple instances with a load balancer
3. Storage scaling with pre-populated S3 tile pyramids for upper extents
Because this data happens to be read only for ten years, scaling is not too hard, as long as there is a budget. It would also be interesting to try some reconfiguration of data into NoSQL type key/value documents with perhaps each polygon document containing the 8912 values embedded along with the geometry. This would cost a bit in storage size but could decrease query times. NoSQL also offers some advantages for horizontal scaling.
The Census Bureau and its census are obviously not going away. The census is a bureaucracy with a curious inertial life stretching back to the founding of our country (United States Constitution Article 1, section 2). Although static aggregate data is not going to disappear, dynamic real time data has already arrived on stage in numerous and sundry ways from big data portals like Google, to marketing juggernauts like Coca Cola and the Democratic Party, to even more sinister black budget control regimes like the NSA.
Census data won’t disappear. It will simply be superseded.
The real issue for 2020 and beyond is, how to actually intelligently use the data. Already data overwhelms analytic capabilities. By 2030, will emerging AI manage floods of real time data replacing human analysts? If Wall Street still exists, will HFT algos lock in dynamic data pipelines at unheard of scale with no human intervention? Even with the help of tools like R Project perhaps the human end of data analysis will pass into anachronism along with the decennial Census.