Recent Posts tagged 'data'

The Physics of the Corporate Universe

Today we're launching 6° of Corporations, a new micro-site that provides some insight into the complicated area of corporate identity. It may sound trivial, but uniquely identifying a corporate entity is not easy. For federal contracting data (like in USASpending.gov), DUNS numbers are used to (supposedly) uniquely identify a contractor. However, there are problems in not only how DUNS numbers are issued and maintained, but also with the agency's use of DUNS numbers. To help illustrate this, we’ve created a visualization that shows the relationship between company names and company DUNS numbers in USASpending.gov.

Sarah's Inbox: The Agony and the .tgz

Many of you have probably already seen that earlier today we stood up a copy of the Elena's Inbox code for the Sarah Palin email collection. You can find the site here. I think that by most reasonable standards, Sarah Palin is currently a less newsworthy figure than Justice Kagan was at the time of her confirmation. But there's no question that many people find her fascinating, and folks seem to really enjoy having this sort of interface available -- the response has been overwhelmingly positive, even in spite of its horrifying Gmail 1.0 look (for what it's worth, Sunlight's design team deserves absolutely none of the blame for this one!).

It's worth taking a moment to reflect on what it took to get this site online. The state of Alaska released Governor Palin's email records on paper. News organizations had to have people on the ground to collect, scan and OCR these documents. Our thanks goes out to Crivella West, msnbc.com, Mother Jones and Pro Publica, whose incredibly quick and high-quality work provided us with the baseline data that powers the site.

But it wasn't yet structured data. It was easy enough to convert the PDFs into text, though this introduced some errors -- dates from the year "20Q7", for instance. Then we had to parse the text into documents, each with recipients, a subject line, and a sender. This is trickier than it might seem. Consider the following recipient list:

To: Smith, John; Jane Doe; Anderson; Andy (GOV); Paul Paulson

It's parseable... sort of. It turns out that, in this case, "Andy Anderson" should be treated as an entity. In this dataset, portions of names are delimited by semicolons, but so are names. It's a bit of a mess. Sunlight staff spent the better part of Monday performing a manual merge of the detected entities, collapsing over 6,000 automatically-captured people to less than half that number. I won't pretend that the dataset is now spotless, but it's considerably more structured than it used to be.

And that structure makes possible not only novel interfaces like Sarah's Inbox, but also novel analyses. Consider this graph of how often the word "McCain" appears in the emails:

total emails mentioning 'mccain' by week

Interesting, right? More substantively, consider the efforts of Andree McCloud, who's raising questions about an apparent gap in the Palin emails near the beginning of the governor's term. With the data captured, it's easy to visualize this -- here's a graph of the total email volume in the system by week, beginning with the first week of December 2006, when Palin took office:

total released email volume by week

(To be clear, I don't think you can necessarily conclude from this graph that there's anything nefarious about that period's low email volume -- there are plenty of potential explanations. Still, it's useful to be able to be able to understand the outlier period in the larger context of the document corpus.)

Of course, these analyses and interfaces could be even better if Alaska had just released the files digitally. In fact, if they had, we might be able to draw some more solid conclusions: as our sysadmin Tim pointed out, message headers' often-sequential IDs could conceivably show whether there actually are missing emails from those first few weeks.

It's a shame that that didn't happen -- and not just because it meant my weekend was spent parsing PDFs. Releasing properly structured data ultimately allows everyone to do better work in less time. It's unfortunate that the authorities in Alaska introduced such a substantial and unnecessary roadblock.

But we at Sunlight can at least share what we've done to improve the situation. If you're interested in running your own analysis, you can find our code here, and the data to power it here (12M). At the moment it's in the form of a Django project -- if you need it in a different format, don't hesitate to ask on our mailing list. If you do something neat with it, please tell us!

Influence Data APIs

Followers of this blog are probably already aware of two of the main sites developed by our Data Commons team: TransparencyData.com and InfluenceExplorer.com. Both sites present a variety of influence related data sets, such as campaign finance, federal lobbying, earmarks and federal spending. Influence Explorer provides easy to use overview information about politicians, companies, industries and prominent individuals, while Transparency Data allows users to search and download detailed records from various influence data sets.

In this blog post I want to show how easy it can be to use the public APIs for both sites to integrate influence data into your own projects. I'll walk through a couple examples and show how to use both the RESTful API and the new Python wrapper.

The Market For Government Data Heats Up

Those interested in the business potential of government data will definitely want to check out Washingtonian's story about Bloomberg Government. It's a good introduction to what really does seem to be the D.C. media landscape's newest 800 lb. gorilla (albeit a very quiet and well-behaved one so far).

Readers of this site will probably be most intrigued by these two pragraphs:

[...] BGov subscribers, of whom there are currently fewer than 2,000 individuals, get something potentially more valuable than news. BGov’s “killer app”—the feature that sets it so far apart from its competition that prospective customers will feel compelled to buy it—is a database that lets users track how much money US government agencies spend on contracts, something no other media organization in Washington offers. Users can break down the spending by agency, company, amount, or congressional district; they can track the money over time; and with a single mouse click, they can call up news associated with the companies and the type of work they do. They can also see which contractors are giving money to elected officials.

All that information is extraordinarily hard to gather, largely because the government doesn’t store it in one place. But when it’s collected, and explained by journalists, the data has the potential to give businesses an inside track on winning government deals. It shows where spending trends are heading and thus where the next business opportunity lies.

Data quality problems aside, this is true as far as it goes -- I've seen a demo of the BGov interface, and it really is quite impressive. But in fact the data isn't that spread out. Between Sunlight's APIs, bulk data from USASpending.gov, GIS data from Census and the admittedly hard-to-scrape Regulations.gov, any startup with enough time and technical talent could replicate the majority of the site's functionality (the business intelligence data provided by Bloomberg Financial is an admittedly tougher nut to crack). That's the great thing about public sector information: it's there for the taking. Anyone can use it.

I've written about this before, and generally argued that government data is a tough thing to create a business around because there's no way to prevent competitors from undercutting you. But there's money to be made in the undercutting. Mike Bloomberg thinks it's worthwhile to bet $100 million on reselling government data. He's made some pretty good business decisions in the past. A smart startup might want to take the hint.

(Of course, nobody will be building businesses on this data if it goes offline -- please don't forget to support our work to save the data)

Announcing Clearspending -- and Why It's Important

Clearspending logo

Today we're launching Clearspending -- a site devoted to our analysis of the data behind USASpending.gov. Ellen's already written about this project over on the main foundation blog, and you should certainly check out her post. But I wanted to talk about it a little bit here, too, because this project is near & dear to my heart, having grown out of work that Kaitlin, Kevin and I did together before I stepped into the role of Labs Director.

The three of us had been working with the USASpending database for a while, and in the course of that work we began to realize some discouraging things. The data clearly had some problems. We did some research and wrote some tests to quantify those problems -- that effort turned into Clearspending. The results were unequivocal: the data was bad -- really bad. Unusably bad, in fact. As things currently stand, USASpending.gov really can't be relied upon.

You can read all about it over at the Clearspending site, and I hope you will -- in addition to an analysis that looked at millions of rows of data and found over a trillion dollars' worth of messed-up spending reports, we spent a lot of time talking to officials at all levels of the reporting chain. I don't think you're likely to find a better discussion of these systems and their problems.

And make no mistake, these systems are important.

We Don't Need a GitHub for Data

picture of Lt. Commander Data standing in front of a screen with the GitHub logThere was an interesting exchange this past weekend between Derek Willis of the New York Times and Sunlight's own Labs Director emeritus, Clay Johnson. Clay wrote a post arguing that we need a "GitHub for data":

It's too hard to put data on the web. It’s too hard to get data off the web. We need a GitHub for data.

With a good version control system like Git or Mercurial, I can track changes, I can do rollbacks, branch and merge and most importantly, collaborate. With a web counterpart like GitHub I can see who is branching my source, what’s been done to it, they can easily contribute back and people can create issues and a wiki about the source I’ve written. To publish source to the web, I need only configure my GitHub account, and in my editor I can add a file, commit the change, and publish it to the web in a couple quick keystrokes.

[...]

Getting and integrating data into a project needs to be as easy as integrating code into a project. If I want to interface with Google Analytics with ruby, I can type gem install vigetlabs-garb and I’ve got what I need to talk to the Google Analytics API. Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?

On his own blog, Derek pushed back a bit:

[...] The biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.

[...]

What I’m saying is that the very act of what Clay describes as a hassle:

A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.

Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.

I think there's a lot to what Derek is saying. Understanding what an MSA is, or how to match Census data up against information that's been geocoded by zip code -- these are bigger challenges than figuring out how to get the Census data itself. The documentation for this stuff is difficult to find and even harder to understand. Most users are driven toward the American Factfinder tool, but if that's not up to telling you what you want, you're going to have to spend some time hunting down the appropriate FTP site and an explanation of its organization -- Clay's right that this is a pain. But it's nothing compared to the challenge of figuring out how to use the data properly. It can be daunting.

But I think there are problems with the "GitHub for data" framing that go beyond the simple fact that the problems GitHub solves aren't the biggest problems facing analysts.

Explore the House's Expenditures

We've updated our House disbursement data to include a "bioguide ID" for each row pertaining to a legislator's office. For more information on why we did that, and how you can use it, read on.

Some of you may know that the House began posting its statements of disbursements online in November of last year. You can find them at disbursements.house.gov in PDF form. We at Sunlight parsed these PDFs and published the data ourselves in a structured format, for easy searchability.

It still hasn't been easy to link this dataset up to others.

Government Data Sets - Managing Expectations



US Open Government plans were released today. As part of this process, federal agencies are beginning to release data sets publicly in ways they never have before. Some substantial and thought-provoking blog posts over the last few weeks have discussed how government can do open data well.

There are significant cultural and social sticking points that have yet to be addressed in releasing data openly. A discussion with a colleague from NASA last week confirmed how far away most agencies are from the luxury of considering the innovative ideas for data set management available to them. Here's why:

Open Data We're Thankful For

While this is a little late-- late's better than never for giving thanks. And this year, we've got a lot to be thankful for. Open Data in Open Government is making leaps and strides. The Vice President is talking data quality in government on the Daily Show. ABC News along with Recovery.gov's controversy have brought government data into prime time. It's been a long time since transparency like this has seen this kind of attention.

At this time of Thanksgiving here in the United States I wanted to give thanks for the new and changing government datasets that we have now. Some are truly amazing.

Recovery.gov's Systemic Failure

The new Recovery.gov-- which we've written about and even nearly bid on-- has certainly taken the government huge steps forward in terms of disclosing information, but it is not without controversy. The press is questioning the program, pointing to wasteful spending or bad data. The White House fired back with a "reality check"(their words) saying that few of the reports have gone through the "extensive three-week review" and that the data might be particularly misleading at this point.

Follow The Labs And See What We're Up To

1818 N Street NW, Suite 300
Washington, DC 20036
202.742.1520