New Data on Data.gov

Good news and Bad news from Data.gov

Looks like Data.gov has added a whole bunch of new feeds, they're up from 47 to 87 in two weeks, not a bad start. Most of the new feeds come from the IRS, they look to be interesting data: 990 forms from 501(c)(3-9) organizations.

That's the good news.

The bad news? It's pretty bad so hold on to your britches. All the data from this data source is labeled as CSV files. But when you look closely, they're not. They're .exe files. See the Tax Year 2005 SOI Exempt Organization Study for instance. Pesky .exe files! This isn't any good-- data that comes from data.gov ought to be at least open standard compressed: .zip files, .gzip files or even .bzips are fine. The problem is, those of us without Windows, we here in the Labs operate on Macs and Ubuntu boxes) really can't get at this data easily.

So, in short: Yay, new data! Booo .exe format!

Even worse, the data, once extracted, seems to not even be in CSV, but in .flat files, with custom documentation. But inside of them, there is some documentation on how to parse them, at least.

In addition, it looks like the Data.gov team took at least one of our suggestions, and put the data.gov catalog itself as a new source. Though the description says it is an XML file, that's not true-- it comes in one format: csv. This means, interestingly enough, for our contest that one could build a better data.gov from this data catalog itself as an entry, and maybe expand it to include data sources from states or other branches of Government.

P.S. For you Mac users, StuffIt Expander can open Winzip compressed .exe files.

Discussion

  1. Robert Kosara 06/02/2009 3:10 p.m. (permalink)

    Works with the command-line unzip utility on the Mac too, so it should work in Linux as well. These .exe files are usually just ZIP files with a little expander program tacked on; they're not as bad as they seem. Yeah, I'm the apologist again, sorry ;) But this is actual data, not PDFs or anything.

  2. Robert Kosara 06/02/2009 3:26 p.m. (permalink)

    The much bigger problem is that most of the data inside these ZIPs/EXEs is .doc and .xls files, rather than .pdf and .csv.

  3. Michael E. Driscoll 06/02/2009 4:03 p.m. (permalink)

    I think this points to the difference between making data available versus making it accessible.

    Making raw data available isn't hard -- you simply upload whatever bundle of bits (in this case, a bunch of .exe files) and voila, you're finished.

    But making data accessible has a tried and true name: publishing. Publishing is hard, and publishing data is arguably harder, as anyone who has ever prepared a data set to share with colleagues, collaborators, or just to post on a web site can tell you.

    Data.gov is a good idea, and I'm a big fan of the initiative, but it will have to overcome some of these burdens if it's going to gain traction.

  4. Pete Skomoroch 06/02/2009 5:54 p.m. (permalink)

    I'm going to side with Data.gov on this one.

    The last thing we should do is discourage them from putting more data up quickly... just label it as "other" instead of CSV. If someone writes a parser, they should be able to submit it and data.gov can make csv, cml, json available when they are ready.

    Raw, ugly data first, nice clean data later ;)

    That said, we will likely need an army of COBOL programmers and Binary file archaeologists to clean things up.

What are Your Thoughts?

Have thoughts that might fuel this discussion further, post them below. (Markdown syntax is supported in comments.)

Follow The Labs And See What We're Up To

1818 N Street NW, Suite 300
Washington, DC 20036
202.742.1520