Redesigning the Government: Data.gov

One thing we’ve been most excited about here at the Sunlight Foundation is the concept of Data.gov. Due later this year, new federal CIO Vivek Kundra will release a new central repository for government data and research. And while in this series we traditionally re-design federal websites, we thought we’d actually take the opportunity to design data.gov right off the bat to show you all what we’d like to see happen.

Here's what we came up with:


Why we did it:

Providing access to government data is one of the clearest ways to be more transparent— and it is our hope that Kundra and team nail this with Data.gov. In order to do so, we’re looking for these things:


  1. Bulk access to data

  2. Accountability for Data Quality

  3. Clear, understandable language

  4. Service and developer friendly file formats

  5. Comprehensiveness

Only raw access bulk data can be completely transparent. So we’re looking for a http://bulk.data.gov akin to Carl Malamud’s bulk.resource.org. This will allow developers to browse through a raw directory listing of the judicial, executive, legislative branches as well as independent/miscellaneous/joint agencies and get compressed, bulk files of data via direct download. Getting FEC data, for example should be as easy as clicking on “Other”->”FEC”->”Contributions”->2008_summary.tar.gz. This first and arguably most important part of Data.gov doesn’t need any design. It needs to look like this:

Bulk Data Screenshot

Secondly, we want the ability for the public to comment and rate the quality of data government provides. The public should be able to rate, review or comment on the data sets Data.gov publishes just like it does books on Amazon.com. This will help Vivek Kundra and his team find slow patches and erroneous data faster than any form of government quality assurance process could. So take an Amazon.com style approach to data:

Ratings Shot

Cataloging the data sources inside the Federal Government is not good enough. Some data sources are simply just not up to par. Data sets like FARAdb are simply unusable as they're being provided by the government to citizens. But we also understand that change cannot happen overnight. In order to make this the most efficient process possible, Government should rely on the customers of its data to pinpoint where the problems are. A reviewing system for the provided data sets does just that.

We also don’t think that Data.gov can exist without an editorial staff. You need people to write about the data and explain the data that's being provided. Let’s face it, traditionally the federal government has mostly written in a voice that lawyers and government officials can understand, but take a look at Data.gov’s closest equivalent right now: Fido. In looking at the different data samples here, can you tell what any of them actually do? Could your mother? Of course not. The very language that government uses is the antithesis of transparency, so use something like this to make it more friendly and understandable:

Editorial Shot

Data.gov should build real, practical descriptions for the data that data.gov provides. It should speak to why each data set is important and beyond relying on the non-transparent federal-speak that is so often used. It should feature data, blog about data, and perhaps even link off to interesting things that other people are doing with the data that comes from Data.gov. But at the heart of this, at bare minimum, Data.gov has to do a better job of explaining the data than Kundra’s first attempt at this, the DC Data Catalog.

Human understanding isn’t enough though, the data that is provided also needs to be understood by machines in formats that are common not only to developers but also to outside services like Google Earth or Microsoft Excel. Data.gov should make it easy for everyone to get to its data in the format that they want.

File Formats

That’s the hardest part about building a real data catalog for the Federal Government. You have databases out there that range from the 30 year old COBOL format at the FEC to the binary access databases that the FCC has been providing! But in order for Data.gov to truly be successful, it has to take these different data sources and make them available in modern data formats that developers and machines can make sense of.

If Government makes these file formats standardized, and makes the forms that request them standardized too, then groups like Sunlight Labs can create helper classes that help developers automatically browse and interact with the data on a programmatic level rather than just browsing through a web interface. Imagine if this is how you, the developer, interact with Data.gov:

Code Interface

Data.gov has to be comprehensive and timely. While the Constitution calls for separation of powers, we do not believe that Data.gov, run by the Executive Branch of Government, should be limited to only Executive Branch information. It should encompass all branches of Government and every independent agency. (p.s. an OPML based list of all government agencies represented in the natural hierarchy of Government should be a data feed!) And it should constantly be growing. When data isn’t available, people should be able to ask for it straight off the website. And obviously those requests should be a data feed in and of itself.

Because this is a government site, we also had to think about how the regular public would interact with the site as well. We made the navigation and search simple, hiding the more complicated asks under an advanced search button, and made the home page consumer friendly by adding a description and a dashboard of the newest and most recently-updated data available. By having these things on the home page, it makes the site browsable and might help users discover data, even if they weren’t searching for anything in particular.

In the end, the purpose of the site should predominantly be about the data itself, and not about conclusions that may be drawn from it. It should be clear, organized, and easy to use for anyone visiting the site.

So, here you have it, the big reveal of Data.gov:


Share |

Discussion

  1. Louise 04/16/2009 12:19 p.m. (permalink)

    Do you envision the Executive branch providing data obtained through research contractors and from the various government evaluations? Or will this only include data collected directly by the Federal government?

  2. Mary Maher 04/17/2009 11:59 a.m. (permalink)

    Flat-out inspiring!

  3. Pito Salas 04/17/2009 12:09 p.m. (permalink)

    And how about if data.gov also supports something like this: Data RSS

    That blog post describes something that I've been working on to make much easier for information providers (like data.gov but many others) to make the information much more discoverable, and by doing that remove speed bumps in connecting software apps and widgets to tie themselves into data providers.

    N.B. The old name for this approach was "Data RSS" but I've found this is not a helpful name, so I am in the process of renaming it to "Distributed Data Discovery."

  4. Mike Mathieu 04/17/2009 12:24 p.m. (permalink)

    I like you're preemptive-design approach! A few points: 1) Bulk data access is nice if you're a well-funded foundation with a good data infrastructure for data manipulations, but a web service that data vizualization people or analyst can build on without needing backend help will lead to more new visualizations and apps.

    2) Amazon-style reviews won't work unless they're tied to performance appraisals of gov't officials. If I were covering up some scam I was running in the gov't, I'd put out the worst possible data set and scare off anyone who thought they could easily use computing power to uncover me.

    3) What's the mechanism for getting data corrected or improved? Lots of commercial data providers take one gov't data source or another, fix the bugs, make it more useful for a specific purpose, and resell it. But the improvements never make it back into the public domain a la some data wiki mechanism.

    4) You make a good point that the value of data can be obscured just by having opaque descriptions of the data. That's why the descriptions of the data and possible uses should be in public hands. It's an essential watch dog function Otherwise, again, if I'm covering up my scam, I'll just try to hide it with obscure descriptions.

  5. Daniel Gehant 04/17/2009 1:19 p.m. (permalink)

    This looks fantastic! I'm curious how this will evolve... Will data.gov prioritize government areas where no data is currently available? Or will departments that already disseminate data, albeit in a very cumbersome way, also be on the to-do-list.

    Also, I've noticed that in some cases, easy one-click access to data might be too open in terms of the resources required to satisfy increasing demand (and potential misuse/abuse).

  6. Susi Alger 04/17/2009 3:31 p.m. (permalink)

    Very nice job, Clay. I think that you've laid this out clearly and concisely with an intuitive and logical system that should work very nicely. The foundation pieces are there and the functionality and details can be built out as resources allow. And I thank our lucky stars that we have someone like Kundra who can run with this.

    Thanks Sunlight!

  7. Sylvia Webb 04/17/2009 9:09 p.m. (permalink)

    This is a excellent start!

    In keeping with the E-Gov Act of 2002 and the FEA TRM,the next step should be the design and implementation of a syntax neutral metadata model based on open standards like ANSI X12's CICA Framework or ISO 15000-5.

  8. rick 04/20/2009 7 p.m. (permalink)

    I highly recommend Tim Berners-Lee's TED 2009 talk. In this talk Tim introduces the Web of Data and Linked Open Data.

    http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html

    There's an opportunity to complement the approach to data.gov shown here with just a small part of what Tim talks about in the Ted video that can act as a catalyst for open government data.

    And the good news is that by publishing with just a few well known vocabularies data.gov can incorporate an open design principle that supports serendipitous linking of government data later.

    Dublin Core is usually a good place to start. I can talk more about some of the other vocabularies once folks check out Tim's video.

  9. J. Paul Duplantis 04/26/2009 1:56 a.m. (permalink)

    The Tim Berners Lee TED talk was excellent.

    The pursuit of moving government data from behind the curtain will provide transparency to the citizenry and increase efficiencies amongst staff.

    Access to information is one thing. Facilitating understanding through relating data is what mashed up linked data can offer and will be transformative if embraced.

    Buy in on the input will be the most important aspect and the most difficult to achieve.

  10. Christine Pierpoint 04/29/2009 11:34 a.m. (permalink)

    One other major consideration for Mr. Kundra to consider is the governance of this cross-agency site. It's one thing to design and launch a site, but quite another to ensure that agency data is effectively integrated.

    The administration will need to define, implement and enforce policies and standards so that content on data.gov meets user requirements. Then, on top of that, resources need to be allocated to address the challenges agencies will face with finding, converting, migrating and maintaining the data. Agencies currently struggle to maintain the volumes of content on their individual sites, so an unfunded mandate to standardize and migrate legacy content will ultimately fail.

    To make data.gov a real success, the Federal CIO needs to ensure that both the front-end design and the back-office operations are well planned and executed.

  11. JessicaCync 05/10/2009 10:26 a.m. (permalink)

    Wow! Thank you! I always wanted to write in my blog something like that. Can I take part of your post to my site? Of course, I will add backlink?

  12. Spoilmmon 05/20/2009 10:10 a.m. (permalink)

    I have found a webhost review site with top 5 ratings. I wonder if they provide good hosting? If anyone has heard anything negative about these host please let me know. Here is the site. webhosting

  13. Mauritius 06/10/2009 6:10 a.m. (permalink)

    Nice Job Clay. I think there is a clear and precise definiton for intuition and logical system. This principle will provide us with a better understanding and facilitate our opinion making beyond the obscure description behind which many hide their scams.

  14. Scott Bryan 10/01/2009 8:19 p.m. (permalink)

    Have you considered getting the folks behind wolframaplha.com involved? They are building the tools needed to harvest the information out of the entire world's data using a model that genuinely understands the semantics of each item. They also most likely have the best tools and strategy for manipulating and representing vast collections of data simply because to those folks the entire universe is just an expression.

What are Your Thoughts?

Comments have been closed on this post.

Follow The Labs And See What We're Up To

  • Introducing the Open State Project API: http://bit.ly/9VseiO 10 states so far (5 are experimental), 37000+ bills, 1600+ legislators

1818 N Street NW, Suite 300
Washington, DC 20036
202.742.1520