Get your act together, Data.gov
- Written by
- Clay Johnson
- Date
- 11/13/2009 2:38 p.m.
On May 21st, we launched Apps for America 2: the Data.gov Challenge-- the very same day that Federal CIO Vivek Kundra & Company launched data.gov. On May 26th, Kundra announced that there were hundreds of thousands of data sources just around the corner.
It is now November 13th, 2009. Right now the Raw Data Catalog in data.gov stands at an even 600 feeds. What's worse, the data is chunked up into small little bits, making 600 not a particularly exciting number. For instance, nearly half the datasets (293/600) in the raw data catalog are toxics release inventory datasets, broken up into individual states and outlying territories further broken up into individual years, from 2005 through 2008. This isn't living up to expectations, or even keeping in line with public statements. This needs to be fixed.
They've broken Geodata out into its own section-- and it contains 110,076 datasets. The same problem exists in the Geodata catalog, though. For instance, here's 387 data sets regarding the Shapefile of Adams County, broken up into years, then address ranges, blocks and county subdivisions.
The amount of public data that government has is unbelievable. Like cash in a stimulus package, public data can create not just jobs but entire industries. For example, just a week after Data.gov was launched, I wrote a post about what I'd change about it including a list of data that could be added to the catalog. I'm saddened that much of it has not been added. What's weird is that Data.gov is a catalog, not a repository-- so adding the data isn't a huge technical burden, but an editorial one. Why isn't this happening faster-- why isn't data.gov living up to its challenges?
I can think of a few reasons:
Politics
I'd imagine that the Data.gov team cannot just link to the data, and that they're working with the different agencies in the executive branch to add data themselves, rather than running an internal editorial team. It may be that different agencies are just not embracing the program, and without significant pressure or incentive, they just won't.
Budget
The Data.gov team may not have enough budget to maintain even consistent operations or attention to it. For all we know the only person in government that is paying consistent attention to Data.gov could be Kundra himself. Though it seems like George Thomas, the Cloud Computing Technical Architect for HHS and former office of the CIO's Enterprise Chief Architect is paying attention to it lately.
A Pending Overhaul
Perhaps, like Recovery.gov, the first stab at Data.gov was a proof of concept-- to put something out in order to justify a larger, bigger play. One can hope. But knowing that there's a second iteration of a "Concept of Operations" (what Mr. Thomas was referring to in his tweet) is encouraging.
Whatever the case may be, work on Data.gov needs to continue, and it isn't going far enough. While I applaud the administration for at least launching a data catalog, Data.gov needs to step up its game if it is to really be considered a success. It may be that the UK shows us how it's done.
Discussion
What are Your Thoughts?
Comments have been closed on this post.
The problem with data.gov and sites like it is that they are built on faulty premises about data:
Fiction: Data doesn't require lots of work to make it useable, so we can just upload whatever we have and it will be useful to somebody. Fact: the big useable datasets (census, ipums, nlsy, all the private marketing datasets) have armies of people cleaning and integrating them. It costs money, it takes time, and it is easy to screw up.
Fiction: Links are worth something. Fact: links are worthless.
Fiction: XML adds values. Fact: aascii tab delimited in consistent formats add value, while XML SUBTRACTS value.
Fiction: a good dataset is easy to use. Fact: even a good dataset (google IPUMS for an example) takes a lot of work to get to know how to manipulate, presuming one can use some sort of statistical programming language in the first place.
Fiction: simple summaries of common data data are useful. Fact: everybody has already done the simple summaries. (This is just a bonus item, and doesn't apply to data.gov, but does apply to faulty thinking about data in general.)
Fiction: Federated data is just fine. Fact: Data that is curated, cleaned, and integrated into one big monolithic package is FAR better, because an analyst can then learn the conventions and names and such in one piece, and parallel categories are more likely to align.
Fiction: Good data is easy for a layperson to use. Fact: good data still requires a lot of skill. Well, maybe in nations with decent public schools a layperson can do something with data, but not in the US.
Here here... Pretty dissappinting.
The geodata situation is particularly bad. They used the huge geodata dump from effecticly 2 or 3 data sets to be able to announce they'd hit their 100k goal. And much of that content is virtually inaccessible due to the lack of a downloadable catalog for this part of the site coupled with the weak search engine. Hope you know the name of the set you're looking for ahead of time. Eventually I wasn't even able to scrape the catalog for geo because they have a hidden max limit of pages in results... Response to my emails on getting a published list of ids or names? None.
The fact that segments of a single data set are effectively only grouped by partial name presented me with a needless roadblock as well.
I can sympathize with how difficult it must be to get the data files from all these agencies cataloged with little money and lots of red tape, but its hard to understand why the data that is there is made so difficult to access and use.
Another thing, I searched for "Florida" and a result for toxic ... 2006 came up. Under Citation, an exe file from epa.gov is linked. This file extracted its contents to/AppData/Local/Temp and nothing else. (Prompt: 7 files unzipped successfully) ... really?
This sucks.
The same problems you're seeing with data.gov are endemic to grants.gov and many other government efforts; we do grant writing for nonprofit and public agencies and described some of the problems we're familiar with in Grants.gov Lurches Into the 21st Century.
Different part of the government, same problems.
I didn't realize HTML isn't part of this blog -- you can find the link above at http://blog.seliger.com/2008/03/27/grantsgov-lurches-into-the-21st-century/ . Sorry for posting twice.
A different discussion on the same blog post: http://news.ycombinator.com/item?id=940625
Beth Noveck, United States Deputy Chief Technology Officer for Open Government is taking voted on questions from Tim O'Reilly at the Web 2.0 conference next week in NYC - you can post questions here: http://w2e.crowdcampaign.com/ if want to ask something like: "What would it take to release cohesive data sets?" or "Why can't we get all datasets in ASCII tab delimited format?"
Full disclosure: I'm a developer for the company that wrote the question taking software (which is how I knew this was going on), but speaking just for myself I would like to see some questions about the formats and structure of the data released on Data.gov.
Are any of you actually surprised at this? If history has shown us anything, it's that government, when left to its own devices, can't produce anything viable or useful to its constituents. This is pretty much what I expected out of Uncle Sam. Which is why I'm a firm believer that it's a private company that will succeed in this endeavor.
"it's that government, when left to its own devices, can't produce anything viable or useful to its constituents. "
Some of the issues with chunking relate to file size limits for txt/csv (e.g. the historic 65k row limit in Excel, et cetera).
Some agency datasets are hundreds of MB in size, some are even larger than that - a user might not want to sit and wait for the entire national dataset to download if he is only interested in a region or state. However, Data.gov lacks any ability to perform a clip-and-ship to extract or subset data.
Further, Data.gov does not have any coherent way to relate data - e.g. a more coherent ontology that allows users to easily and transparently navigate/drilldown all of the varying related datasets, e.g. annual data series, state-by-state data series, different formats of the same data and so on.
And relating to the descriptive info - here, their metadata is severely constrained and lacking.
It's a mixed bag. The concept and initial implementation is well founded. It's moving beyond simple registries with multiple deep links into directly linked data. (Data.govt.nz has the same old problem of linking to subsequent forms and pages instead of to data).
Building data catalogs and repositories has never been easy. For proof look at the hundreds of failed efforts across the world over the past few decades.
There is still very active efforts to make both the general data better gathered, and especially the geographic data more prevalent, findable, usable, and updated in Data.gov.
We can all help - highlight what's working, keep building mashups, applications and the like using the current data. Build comparative repositories and links to data that demontrate what Data.gov could do and how it could do it.
We're all figuring this out together, the more we support and collaborate, the better the chances we'll end up with a success for everyone.
Making requests public, and hopefully being able to see progress and current status, would help for dataset suggestions: http://www.data.gov/suggestdataset
These are some good points, and I agree with @Jake Seliger that these problems seem to form a pattern across government web sites.
I'd love to hear your reaction to a blog post I wrote in reply here:
http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/
The TIGER example of 378 data sets - they're actually shapefiles for Adams County in lots of different states, for different map layer features, for various release years. Yes, data.gov could think of Adams County, WI 2008 TIGER as a single dataset. Disclaimer - I work at the Census but not in the GEO division.
Good story, and great comments. I posted on this and a related study out of Harvard School of Public Health suggesting that electronic health records are not yet making a difference to performance... http://bfc.ashinstitute.harvard.edu/columns/?id=52