Kickoff: The National Data Catalog
- Written by
- Clay Johnson
- Date
- 07/15/2009 11:24 a.m.
Sometimes you can get inspired by government. In our field it happens more than you'd think. Obviously all our new tools-- new things like TransparencyCorps and Congrelate along with CapitolWords have been inspired by government to a degree, but there aren't many ideas that we've actually stolen from government.
Today I'm happy to announce we're stealing an idea from our government. Data.gov is an incredible concept, and the implementation of it has been remarkable. We're going to steal that idea and make it better. Because of politics and scale there's only so much the government is going to be able to do. There are legal hurdles and boundaries the government can't cross that we can. For instance: there's no legislative or judicial branch data inside Data.gov and while Data.gov links off to state data catalogs, entries aren't in the same place or format as the rest of the catalog. Community documentation and collaboration are virtual impossibilities because of the regulations that impact the way Government interacts with people on the web.
We think we can add value on top of things like Data.gov and the municipal data catalogs by autonomously bringing them into one system, manually curating and adding other data sources and providing features that, well, Government just can't do. There'll be community participation so that people can submit their own data sources, and we'll also catalog non-commercial data that is derivative of government data like OpenSecrets. We'll make it so that people can create their own documentation for much of the undocumented data that government puts out and link to external projects that work with the data being provided.
We're starting this project today, now, and will be building it out in public. Two developers here will be working on it: Luigi and David. We've set up PivotalTracker for the project and of course you can find the source. This project has three major components, and there's three separate repositories for them. There's the API, the web catalog, and a ruby library for the API. These things will work in symphony-- we're building our API first and our data catalog website will run on top of it, using the ruby-datacatalog client library.
In terms of timeline, we're ruthlessly ambitious, hoping to have something up after the contest ends. That's not set in stone, but we'll do our best to get there. The catalog is going to have three components to start with: an api, a web interface, and a command line interface. If you're interested in helping out with this project, please join our Google Group. If you just want to help the Sunlight Foundation fund this project, please consider a contribution.
Both David and Luigi will be blogging updates periodically throughout the process, and we'd appreciate any feedback or help you can give. You can also submit data sources you'd like to see added to help us get started.
For updates on what's going on in the labs, you should follow me on twitter here.
Discussion
What are Your Thoughts?
Comments have been closed on this post.
Great initiative. I agree that there are many things outside organizations can do to augment and extend the offerings of the government. It's fortunate that the government has evolved in such a way to be just such an inspiration or platform.
Another valuable piece is federation of data. There isn't any one place data should all come together to exist, but many various domain specific repositories and toolsets. They will cultivate different communities, some about environment, or economics, or personal safety, and each has a different need and desire to fulfill.
However, each of these sites and repositories can leverage one another by sharing data, profiles, and toolsets.
I work on GeoCommons - a similar open data repository that anyone can contribute data, find, download in a variety of formats (for example, upload a CSV, and download a KML or RSS feed) and mapping visualization and analysis tools.
So it would be great to collaborate on how to share information between our, and many other, systems out there using open formats and user-centric principles.
So this is similar to something I tried to start back in the day (=2005) by encouraging everyone to get on the same page using RDF. Not that RDF doesn't have problems, but it still seems to me to be a leading approach.
More here: http://razor.occams.info/blog/2009/03/02/civic-hacking-the-semantic-web-and-visualization/
I'd like to know more about how you would mesh all the data together under the hood.
Josh:
The short answer is: We don't. Not for now-- we're more interested in cataloging the data than mashing it together. Once we have a catalog of what's where, then we can start looking for ways to fit the pieces together. Probably starting with the metadata on each data set.
Andrew: I like what you wrote: "each of these sites and repositories can leverage one another by sharing data, profiles, and toolsets" We're going to encourage community documentation of data sets so that people who use these data sources can learn from each other and build on each other's work.
Hi I'm from OECD. we have started to think of "publishing metadata for data", that is the same kind of information librarians manipulate for books or journals, only applicable for data. my boss wrote a white paper about this: http://dx.doi.org/10.1787/603233448430
that's the infrastructure of our new content distribution platform, http://www.oecdilibrary.org.
like data.gov's model which is based on dublin core, this is also an adaptation of existing standards.
but like data.gov we are a governmental agency and there is only so much we can do. but at least we can take part in the discussion.
what I found hardest in cataloguing our own data (which is a very small collection compared to all of USA federal data, let alone all available data) is to come up with a universal vocabulary to describe data: what is a dataset, what is it made of, what is it part of, etc.
I really like your thoughts and I my definetely agree with you , it's that there have some place where we can comment this topic :)
Ah crap... This is more or less exactly what I was trying to accomplish for my apps for America 2 entry. Guess I should have gotten it up and running faster!
Oh well at least I know it was a decent idea. Perhaps I can contribute...
@Brian: Hey, an entry is an entry. Go ahead and do it anyway. There's nothing in the rules that says that you can't implement an idea that Sunlight is also implementing.
Yeah Luigi said something similar. No worries. I may go ahead an enter it anyway (if i can actually meet the deadline). Or if I can contribute something to your project that's good too. I was just having a bit of a face-meet-palm moment there. :)
Regardless I'm glad something like this is going to be made available. Naturally I think it's a good idea and I'm glad you guys are doing it even if I was surprised! The barrier of entry to play around with public datasets (or build an app with them) is still way too high for most people. The easier it is to get access to and understand the better it will be for everyone.
@cjoh Ahh, I misunderstood. Anyway, neat.
Wow, look at the announcements I miss when I'm on vacation! I too have been scheming an almost identical website, but focused on the context of the municipal level.