Recent Posts tagged 'data'
We Don't Need a GitHub for Data
- Written by
- Tom Lee
- Date
- 08/05/2010 12:04 p.m.
- Comments:
- 7
There was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight's own Labs Director emeritus, Clay Johnson. Clay wrote a post arguing that we need a "GitHub for data":
It's too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.
With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.
[...]
Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?
On his own blog, Derek pushed back a bit:
[...] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.
[...]
What I’m saying is that the very act of what Clay describes as a hassle:
A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.
Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.
I think there's a lot to what Derek is saying. Understanding what an MSA is, or how to match Census data up against information that's been geocoded by zip code -- these are bigger challenges than figuring out how to get the Census data itself. The documentation for this stuff is difficult to find and even harder to understand. Most users are driven toward the American Factfinder tool, but if that's not up to telling you what you want, you're going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization -- Clay's right that this is a pain. But it's nothing compared to the challenge of figuring out how to use the data properly. It can be daunting.
But I think there are problems with the "GitHub for data" framing that go beyond the simple fact that the problems GitHub solves aren't the biggest problems facing analysts.
Explore the House's Expenditures
- Written by
- Eric
- Date
- 04/28/2010 3:54 p.m.
- Comments:
- 2
We've updated our House disbursement data to include a "bioguide ID" for each row pertaining to a legislator's office. For more information on why we did that, and how you can use it, read on.
Some of you may know that the House began posting its statements of disbursements online in November of last year. You can find them at disbursements.house.gov in PDF form. We at Sunlight parsed these PDFs and published the data ourselves in a structured format, for easy searchability.
It still hasn't been easy to link this dataset up to others.
Government Data Sets - Managing Expectations
- Written by
- Jessy Cowan-Sharp
- Date
- 04/07/2010 5:51 p.m.
- Comments:
- 9
US Open Government plans were released today. As part of this process, federal agencies are beginning to release data sets publicly in ways they never have before. Some substantial and thought-provoking blog posts over the last few weeks have discussed how government can do open data well.
There are significant cultural and social sticking points that have yet to be addressed in releasing data openly. A discussion with a colleague from NASA last week confirmed how far away most agencies are from the luxury of considering the innovative ideas for data set management available to them. Here's why:
Open Data We're Thankful For
- Written by
- Clay Johnson
- Date
- 11/29/2009 1:01 p.m.
- Comments:
- 0
While this is a little late-- late's better than never for giving thanks. And this year, we've got a lot to be thankful for. Open Data in Open Government is making leaps and strides. The Vice President is talking data quality in government on the Daily Show. ABC News along with Recovery.gov's controversy have brought government data into prime time. It's been a long time since transparency like this has seen this kind of attention.
At this time of Thanksgiving here in the United States I wanted to give thanks for the new and changing government datasets that we have now. Some are truly amazing.
Recovery.gov's Systemic Failure
- Written by
- Clay Johnson
- Date
- 11/10/2009 1:55 p.m.
- Comments:
- 5
The new Recovery.gov-- which we've written about and even nearly bid on-- has certainly taken the government huge steps forward in terms of disclosing information, but it is not without controversy. The press is questioning the program, pointing to wasteful spending or bad data. The White House fired back with a "reality check"(their words) saying that few of the reports have gone through the "extensive three-week review" and that the data might be particularly misleading at this point.
Data Commons Matchbox
- Written by
- Jeremy
- Date
- 09/30/2009 2:19 p.m.
- Comments:
- 1
Earlier this year we started on the Data Commons, a project to merge open government data sets to make them more searchable and usable. Our goal for the initial release is to load state and federal campaign contribution data from The Center for Responsive Politics and The National Institute for Money in State Politics. Along with the raw transactional records, we will be taking the additional step of matching the entities (people, organizations, corporations, etc.) across the data sets. We'll have more posts soon with details about the Data Commons.
To assist us in this effort, we are developing Matchbox, a toolkit for the merging and matching of entities. We have big plans for Matchbox, but want to get feedback from the community as we improve it over the next few months.
Your Input Wanted on Recovery.gov Data
- Written by
- Luigi Montanez
- Date
- 09/24/2009 2:30 p.m.
- Comments:
- 1
Here at the Sunlight Labs, we've focused a lot on the recent bid on version 2.0 of Recovery.gov. This morning on the Labs mailing list, Rusty Talbot of Synteractive, one of the winning contractors, asked for input on the best way for Recovery.gov to publish its data.
Rusty wrote:
The Recovery, Accountability, & Transparency Board wishes to have an open discussion with all interested developers about how data should be made available via Recovery.gov.
As you are all aware, a new version of Recovery.gov will be released soon. From a data standpoint, the initial release of the new site will replicate existing functionality. However, the Board aims to set a new standard of transparency with this site and would therefore like to make the data available in the most convenient and straightforward way (or ways) possible so you can use and analyze official, up-to-date Recovery Act data. We need your input to achieve this goal.
Please let us know how the site could best meet your needs in terms of machine-readable data format(s) and standards, APIs, guidance, training, etc.
This is a great opportunity for all of us who work hard to make government data more open and accessible.
Dealing with Inaccurate Government Data
- Written by
- Clay Johnson
- Date
- 08/27/2009 12:56 p.m.
- Comments:
- 8
Developers are good at getting bits to line up, importing data and getting great conclusions out of it. Designers are great at making things look great and making those conclusions and bits easily digestible. But in all the apps I've seen, they all ultimately suffer from the same fatal flaw: accuracy.
So Many Earmarks
- Written by
- Eric
- Date
- 08/21/2009 1:01 p.m.
- Comments:
- 5
When we launched TransparencyCorps at the end of June, we ran a few small earmark campaigns, to digitize little batches of earmark request letters that legislators had posted on their websites. These campaigns wrapped up very quickly, and at the same time, the House decided to release earmark request letters en masse, and we didn't have to do our campaigns per-legislator anymore.
Given the demonstrated interest in earmarks, we decided to run a much larger campaign, for all the earmarks released by the House Appropriations Committee, starting with those for the Commerce, Justice, and Science Subcommittee. These were released in a single massive PDF, which I split up into individual 1- or 2-page request letters.
This campaign involved 1,183 letters, and we had the campaign run for 5,537 completed tasks. Total volunteer time, as measured on TransparencyCorps: over 472 hours. That's nearly 20 man-days. Here are the results.
Is Government a Data Wholesaler or Retailer?
- Written by
- Clay Johnson
- Date
- 08/14/2009 1:22 p.m.
- Comments:
- 4
Imagine if Costco announced that they were going to take the Costco experience to Manhattan, and open up convenience stores across the the island. Further, imagine shopping at these new CostCo bodegas, all of 500 square feet, with your giant cart, selecting from what the CostCo bodega has to offer in this limited amount of space! At your local CostCo bodega you have to choose from either 400 rolls of toilet paper, 70 lbs of dehydrated mashed potatoes, or a 6 pack of giant boxes of cereal. That's pretty much all they could store in inventory at the CostCo bodega because they wouldn't have room in 500 square feet for anything else. And good luck carrying all that home!
Sounds absurd, doesn't it?
