Everything We Know About Data.gov

Now that Data.gov's out, I thought I'd take a look under the hood and see what's in there, what's missing, and try and figure out what's coming.

First off searching through twitter for the phrase "Data.gov congratulations" I'm able to get enough evidence that hmiller23 and Jerad Speigel of the Phase One Consulting Group built the site. I asked them on Twitter, and they said "It Uses LAMP"

Right now the site is short on data. Federal CIOs: There are hundreds of us waiting to do interesting things with your data. Invest in putting it up on Data.gov now. You will be rewarded.

Right now the breakdown of the files looks like this:

Data.gov Format Breakdown

In terms of number of datasets per agency, here's what we're looking at:

Untitled

So the US Geological Survey represents roughly half the data (which also may be why the available datasets are in KML or ESRI).

That's the thing that really must change now-- and that's going to be what will determine the success of Data.gov. There's a lot of datasets that the federal government has that have not been included, big datasets like the FACA Database, the FARA Database, and what about OMB's own Federal Budget?

But that's not stopping us. Already-- in less than 24 hours, we have one entry to the contest. Go ahead and play FBI Fugitive Concentration!

Discussion

  1. Andrew Turner 05/22/2009 10:44 a.m. (permalink)

    These are very interesting - thanks for compiling.

    It's disconcerting the large number of Shapefiles - considering it's not an open format, just a well reverse-engineered one.

    However, the alternatives such as KML will have very large file sizes, and CSV's lack rich geometry for visualization.

    We need to push formats such as SQLite as an alternative for compact, rich, open data formats and tools to support their use.

  2. Alan Howlett 05/22/2009 12:36 p.m. (permalink)

    I don't understand your comment. Shapefiles can be easily loaded into Oracle and ESRi products. KML and CSV can easily store the same data as shapefiles. They are just simple formats. SQLite is a database product. Sort of comparing apples to toaster ovens.

  3. Shaun Farrell 05/22/2009 12:38 p.m. (permalink)

    So the next graph that is missing should be the raw data vs tools. Most of the data is in tool form. These tools don't allow developers to do anything. It's far from transparency. The only raw data is USGS, NWS, and NOAA. I sure the heck don't want to mashup earthquake patterns and how they effect the migration of birds!

    Why isn't any of the USA Spending data in here!

    RAW data is transparency. I can make a tool that makes bad data look good any day of the week.

  4. joe mclambert 05/22/2009 2:33 p.m. (permalink)

    i like alan howlett

  5. Ian 05/22/2009 3:18 p.m. (permalink)

    OpenStreetMap is already using quite a bit of the US government's spatial data. I would say that it is a pretty darn good visualization of government data :).

  6. Andy Cavatorta 05/22/2009 10:49 p.m. (permalink)

    Not to be nitpicky, but I'd like to hear more about how it uses LAMP. The HTTP responses identify the server as Microsoft-IIS/6.0 . I checked because I the URLs look RESTful and I wondered if it was built w/ Django. I was hoping to see some open technologies in the .gov domain.

  7. Marten Hogeweg 05/24/2009 11:32 p.m. (permalink)

    The shapefile format has been published by ESRI since 1998. Get the spec at: http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf

  8. James Standen 05/25/2009 11:02 p.m. (permalink)

    I think having the raw data is outstanding, but agree 100% that I'm more interested in program spending then the soil samples.

    Lets hope they put lots more up in the coming months.

    They also need to get the meta data more standardized (but then again, don't we all).

    http://www.datamartist.com/datagov-looking-at-the-us-governments-data

  9. Mike Chelen 05/27/2009 7:46 p.m. (permalink)

    This post inspired another look at Data.gov collection statistics by file size, in addition to agency and format. Using with Google Docs so it updates when new sets are added and so the source spreadsheet is available.

  10. Alistair Conditioning 12/03/2009 8:51 a.m. (permalink)

    Hi Clay, great breakdown of information, hopefully they will keep adding information over the next 12 months.

What are Your Thoughts?

Have thoughts that might fuel this discussion further, post them below. (Markdown syntax is supported in comments.)

Follow The Labs And See What We're Up To

1818 N Street NW, Suite 300
Washington, DC 20036
202.742.1520