What I'd Change about Data.gov

I think Data.gov is pretty awesome. I'm generally a fan of what Vivek Kundra & Team are trying to do inside of the government to make the our country more transparent. Heck, we're so excited about it we're doing our own contest with cash prizes to celebrate.

But I do have a few gripes. So in the interest of full transparency, and the hopes that this will create change, here are my complaints for all to see:

1. Half the data is from the USGS.

No offense to our hard working geologists, but seriously-- copper smelters? Really? Why is the first dataset on Data.gov about Copper Smelters? And more importantly, every piece of data that's on the front catalog page of Data.gov has a 1 by it. Is that because they wanted them to appear at the top of the list? So these four (Smelters, Hydrolic Remote Sensing Center, Patent Grants, and Residential Energy Consumption from 2005) datasets were editorially chosen to lead the pack?

I want better data, and there's a lot of it out there, and there's no excuses for it to be inside of Data.gov. It is data that's already being maintained by the feds. Ones I'd particularly like to see, in no particular order:

  1. How about the data in Data.gov. Put Data.gov's catalog online in a bulk format for all to see and play with.
  2. FARA
  3. FEC
  4. FACA
  5. Personal Financial Disclosure Statements for Cabinet and Key Government Employees.
  6. USASpending.gov Downloads
  7. The Federal Register -- this one's special, and a little political. But the Government shouldn't be charging $17,250 for an electronic copy of the Federal Register.
  8. Census All of it. In something other than PDF files, too please.
  9. Bureau of Labor Statistics All of it.
  10. Bulk data from FedBizOpps
  11. Of course, all the data on Recovery.gov

I'm sure there's more than these 10 datasets. According to the feds, there's 200,000+ more coming, so get on with it, hurry up!

2. It is a data catalog, not a data repository

This isn't just semantics-- the data on Data.gov links out to external sources that are not standardized. This means it is very hard to wrap programatically. For instance, if you go check out the Patent Grant Bibliographic Data for instance, you'll see that you can download the file as an XML file from uspto.gov. This means Data.gov is merely linking off to another site, rather than serving as a single source for the data.

Fine, cool, I can think of a million reasons to do that, especially that whole Separation of Powers bit. This would make it so maybe Data.gov could link off to congressional information without having to cross the line into the Executive Branch compelling congress to do something or having to wait on legislation (maybe), but the problem is, even the links are non-standardized and not restful. What we want is to be able to presume:

a. the Patent Data has an ID number of 3 b. It has XML data c. Therefore, to get the XML data, we can go to data.gov/data/3/xml

And have the software point us to the data we want. This kind of REST-ish interface for the website would be particularly useful. That way we could build software similar to RubyGems for Data.gov. How cool would that be? My dream? To be able to type in:

datagov install census.economic -y 2007 -v csv

And see my terminal download that information directly onto my hard drive in a format that I, as well as my trusty computer can understand. Data.gov can lead us there. Where we need to head is for the data to all be in the same place, with standard formats, and reliability that it will always be there.

3. It doesn't engage us directly

I don't just want you to put links to the data up there, this is the biggest technical transparency and openness initiative the Government has undertaken in a long time. It is also going to be a hub for developers. So talk to us, engage us, have a blog, tell us what's going on and what to expect.

So much of dealing with data is narrative, and telling the story of Data.gov on an ongoing basis has so much value to it. We want to know what's going on on the inside, who is working on it, what the process is and who is building it. How are you talking Federal Agencies into putting their data online. What software challenges are you facing? When there's new data, how will we know? (Here at Sunlight we built our own RSS feed for it.)

Those are my three biggest gripes. But all in all, it is a great contribution to society that I think will make amazing things happen for years to come. Heaps of praise, appreciation and gratitude for the sleepless nights that went into building this site. What would you change?

Share |
Tags:

Discussion

  1. Robert Kosara 05/28/2009 3:02 p.m. (permalink)

    Good points. However, I think part of the reason data.gov works the way it does right now is so they could get it done with a reasonable amount of resources. Sure, it would be great to have all the data hosted in one place, ftp access, etc., but that would require a lot more infrastructure and people to set up and run. Given the choice between the perfect system and what we have now, I'll take the status quo.

    I also believe that we'll see many of the simple things addressed soon. RSS feeds are surely among them. I also think they'll have a blog sooner or later, but probably a read-only one - they won't be able to deal with the barrage of comments, criticisms, ideas, etc. they'll be getting.

    Without trying to be apologetic here, I think they've made some good decisions. It's not a perfect system by any means, but it's a very good start. And once they add those promised hundreds of thousands of additional datasets, I think it will look a lot better and prove a lot more useful than it is right now. Those copper smelters really aren't all that fascinating, I'll give you that.

  2. Jeff Walpole 05/28/2009 6:39 p.m. (permalink)

    something I was going to tweet about before I googled first and found this post. I think they will get there, thanks for watching out for it.

    Also, what do you guys know about plans for RDFa?

    rockstarin86While I am a huge fan of data.gov, I agree with @sunfoundation on the usability of it at this point http://bit.ly/fVGOK

  3. Pete Skomoroch 05/28/2009 11:23 p.m. (permalink)

    Well said Clay. I think bulk.resource.org is the model to follow. Provisioning servers is a problem? Upload the raw data to Amazon S3 and provide a simple page of links to the buckets. S3 buckets can automatically be exposed as torrents as well. That is cheap and takes much less effort than the current data.gov site, so I don't buy Robert's argument. At this rate, Amazon Public Datasets will have more raw government data than data.gov, which seems odd.

    People tend to get caught up with window dressing, when all developers really need is the raw data - Web 1.0 style. Make it available and we'll clean it up. Start with raw database dumps of each table (every legacy system can at least do that), then we can go from there.

  4. Carl Malamud 05/29/2009 4:46 a.m. (permalink)

    Nice post. I'd two items, one trivial, one real.

    Trivial: the HTML doesn't validate. Grr. 66 errors last time I looked. There ought to be a law.

    Non-trivial: we need more data. Federal Register source is a good suggestion, but I'd go beyond that to include all the Official Journals of Govenrment (of which the Register is one of 11 pubs).

    And, while we're on the subject of more data, would definitely include full text of all patents. Right now, PTO makes the cover pages available, but you have to pay for text. This particular database is called-out in the U.S. Constitution, seems deserving of release.

  5. Mike Chelen 06/01/2009 5:35 p.m. (permalink)

    2 would be less of an obstacle if the collection metadata, including links to off site data sets, were available in a machine readable format. Scraping the HTML to generate an RSS feed helps, and highlights the lack of such features built in.

  6. Chris Amico 06/02/2009 11:08 a.m. (permalink)

    BLS data alone would make my life easier. If ever there were a government agency in need of an easy and well-documented API...

  7. sisolanda 06/07/2009 6:07 p.m. (permalink)

    Good evening (day, morning). At me a question, on FOREX it is possible to earn? Somebody did it, remained with profit?

  8. adsensemoneyus 08/07/2009 4:44 p.m. (permalink)

    Google?s AdSense can be used to make loads and loads of money, especially if you know the right tricks http://adsense-money.us/

  9. kiltetewoth 08/20/2009 6:01 p.m. (permalink)

    Powerball Lockerbie

What are Your Thoughts?

Comments have been closed on this post.

Follow The Labs And See What We're Up To

  • Introducing the Open State Project API: http://bit.ly/9VseiO 10 states so far (5 are experimental), 37000+ bills, 1600+ legislators

1818 N Street NW, Suite 300
Washington, DC 20036
202.742.1520