Census data

I have been busy for awhile now working with demographic data. The US Census Bureau collects an extravagant amount of population data in the decennial census which last occurred in 2000. They are currently preparing for the 2010 census. http://www.census.gov/ The challenge in my case is to provide an interface to the demographic data in a thematic form for use against an OWS background. I am using an SVG client so I have full access to vector polygonal entities.

My original thought was to use WMS SLD <FeatureTypeStyle> to color fill polygons by calculated population range. Although this is possible with simple queries it becomes cumbersome when dealing with table JOIN sql queries. The way the census demographic data is organized provides several tables of statistical fields described in a catalog matrix. The front part of each record includes the geography header which keys to a special table of geography records. The geography table includes the geography hierarchy fields but not the actual geometry. An additional JOIN is required from concatenated geography fields to a polygon geometry table. In essence a double JOIN is used to get from a population statistic in the catalog matrix to the actual points of the associated geometry.

Here is a simple PostGIS SQL select on a single matrix element, H004002, at the county level of geography within a spatial bbox for Summary File SF1 Census demographics:

SELECT AsSVG(g.the_geom,1,6) as geom, sf1.”H004002″ as field, g.”NAME” as label, geo.”STATE”||geo.”COUNTY” as id FROM “SF10037″ sf1
JOIN “SF1geo” geo ON geo.”LOGRECNO”=sf1.”LOGRECNO” JOIN “county” g ON geo.”STATE”||geo.”COUNTY”=g.”STATE”||g.”COUNTY” WHERE geo.”SUMLEV”=’050′ AND geo.”GEOCOMP”=’00′ AND g.the_geom && GeometryFromText(‘LINESTRING(-107.11538505554199 38.072115898132324,
-104.59134674072266 40.524038314819336)’,4269)

Resulting in this view

The difficulty in using an OWS service from something like GeoServer is that the database query is automatically built from an xml filter definition. I did not see a way to build the complex join query using the current filter implementation. Short of diving into the GeoServer code, I could not use an OGC standard approach to my query and decided to bypass the OWS approach and go directly against the PostgreSQL/PostGIS. There is, however, an effort in the GeoServer community to add a complex feature capability which may address this type of issue. http://docs.codehaus.org/display/GEOTOOLS/Community+Schema+Support+and+Complex+Types
Census geography follows a tree hierarchy which subdivides geography into smaller and smaller polygonal entities. Unfortunately the tree is not homogeneous across all of the US. For example, ‘Native American Lands’, ‘Populated Places’, ‘Metropolitan Areas’ etc each have a different tree. The vast majority of the US, though, follows a simple ‘state -> county -> tract -> blockgroup -> block’ hierarchy.

The census does not necessarily collect/publish data to the full block depth for all demographic data. For example, using Summary File SF2 with a characteristic iterator of ‘Pakistani alone’ and total population in occupied housing units, no records are available below the county geography. On the other hand using SF1 vacancy status for housing, records are available all the way down to the block level:

The thematic legend color scheme was chosen using the excellent information by Cynthia Brewer for sequential data representation: http://www.personal.psu.edu/faculty/c/a/cab38/ColorBrewer/ColorBrewer.html

I kept the class categories and color range simpler by implementing with a small 5 step interval across the data range. The interesting problem with any type of sequential thematics is choosing appropriate class ranges. The simplest approach is to take the total range and simply divide it by the selected number of interval to obtain the class extents. This approach can often result in large numbers of homogeneously colored polygons, since many polygons have zero or small populations while a few have a large set of population. The majority of polygons will then fall into the same color class with only a few color coded further up the range. In fact if the range spread is significant there may be several classes with no polygons at all.

Another approach is to sort the polygons by population and divide the total number of polygons by interval into classes, which are then assigned colors. This approach classifies across the number of polygons represented instead of across the population range. The result is a more pleasing gradation of polygon color. However immediate problems occur when the data clusters at one end of the range. The result of this simplistic approach is that many of the classes fall into the same interval while only one or two classes contain all of the clustered data falling at one end of the histogram. This can be seen above where all four of the lower classes are taken up in values between 0 and 1 while only the last class picks up the high end of the population data range. The maps may be more pleasing but the information conveyed is again skewed.

The last approach can be tweaked a bit by subdividing only the polygons containing populations above zero. By lumping all of the zero range into a single class and then subdividing the remainder of the data, where there are interesting things going on, the view is a bit less skewed for zero dominant characteristics. Of course the data may be dominated by some value slightly above 0 in which case the tweaking is not especially helpful

The issues of chloropleth mapping are quite complex and even of interest to mathematicians. There are some in depth mathematical texts analyzing the intricacies of the problem which I am avoiding for the moment. My ultimate goal is to take advantage of SVG to render the actual histogram chart of the polygon selection sorted into population order. This chart would be overlaid with a set of simplistic intervals as low opacity elements with draggable edges. The idea is to allow the user to carefully select non symmetric classes from the histogram view. The histogram with its class selection is retained as an enhancement to the legend so the viewer knows exactly the class extents shown in the thematic view. In essence this passes the complexity of classification off to the user, but also retains a record for the viewer.

The need for a user profile and a way to save view setups is also apparent since the idea ultimately is to create some interesting analysis and make it available to other viewers anywhere in range of the internet.

Here is a screen shot of a similar interface developed a while ago for imagery clamp and thresholding using an interactive range setting on the rgb histograms.