Recent Posts by Tom
Shouldn't Robots Be Doing My Taxes By Now?
- Written by
- Tom Lee
- Date
- 04/17/2012 10:39 a.m.
It's Tax Day, and if you're a software developer, I'll bet you find it as mystifying as I do. Not the actual tax preparation (mine are still pleasantly straightforward, I'm happy to say), but the general awfulness of the experience. Why am I responsible for collecting PDFs (or worse, paper) from a half-dozen institutions, then manually reentering that data? Why am I paying a vendor $50 for what amounts to some unit tests and an electronic transaction or two?
It makes no sense. Government uses technology for a lot of things, and some of those things are very hard [insert requisite reference to the Apollo Program here]. But filling out forms is not a hard thing. In fact, it's one of the problems that web technology has tackled first and most comprehensively. The first thing you learn in most web frameworks is how to make forms! It's hard to think of any other part of the government's mission that affects so many people negatively and could so easily and obviously be improved by better technology.
The IRS is trying to make progress on this score, of course. E-Filing has been with us since 1986. And they seem excited about the new version of their IRS2Go mobile app. But why on earth would I want a mobile app to help me find the IRS's YouTube channel?
Here's a better idea: instead of assuming I want to learn more about how to do my taxes, why not make it so that I can afford to know less about the process? Five minutes in a text editor tells me that my W-2 can be represented in less than 300 bytes -- a fraction of a QR code's capacity. How about promulgating some data standards that would make it easier for me to digitize all those 1099-INTs saying that I earned thirty cents on a checking account? Surely TurboTax or H&R Block would be willing to create some mobile apps that let me input my information by scanning a matrix barcode with my phone.
Better yet: since the agency is already receiving that data from all those financial institutions through a separate stream, how about organizing the data for me and simply letting me sign off on my automatically-generated return? I suspect that a lot of people would like that, given that the alternative is spending a spring day doing paperwork.
Naturally, this is not an original idea. As you'll see in these fine pieces from United Republic and the New York Times, many people feel that lobbying by firms like Intuit (the makers of TurboTax) has stopped efforts to make filing your taxes less unbearable.
Is this a case of malign influence peddling to prop up an industry that should be partially automated away, or is it just another example of government technology badly lagging behind that of the private sector? Whatever the case might be, here's hoping something changes soon. The fact that we're still doing our taxes this way is ridiculous.
Three Cheers for the CFPB
- Written by
- Tom Lee
- Date
- 04/09/2012 6:49 p.m.
The financial watchdog agency announced an ambitious open source policy today, and we couldn't be more pleased at the news. The CFPB's announcement post does a great job of explaining their rationale: open source makes innovation easier, lock-in harder, and delivers value to taxpayers both by keeping procurements competitive and making sure their outcomes can be broadly shared.
It wasn't too long ago that government was scared of even using open source code, much less publishing its own. Its growing embrace by agencies like the CFPB and NASA is a testament to the hard work of organizations like Open Source for America. But it's also reflective of a long-established US norm that's only now being translated into the digital age: the federal government belongs to all of us. That's why our country's publications aren't copyrighted; it should be why its code is freely licensed, too.
At any rate, it goes without saying that Sunlight loves open source technology -- it's something we believe in and enjoy using. It's great to see that the CFPB feels the same way.
Sunlight at Southby (and PyCon!)
- Written by
- Tom Lee
- Date
- 03/08/2012 3:36 p.m.
In a few short hours I and much of the rest of the internet will be descending on Austin, TX for SXSW Interactive. If you're among the folks who'll be attending, I hope you'll consider coming by one or more of the panels and events we'll be doing:
- Drew will be talking about corporate (and other) identifiers
- I'll be on a panel with Sarah Cohen and Vivek Kundra, where we'll discuss the successes and shortfalls of Gov 2.0
- And Ellen will be helping to judge SXSW Accelerator
But even if you can't make it to the panels, we hope you'll say hello -- just drop either Drew or me an email (tlee/dvogel (at) sunlightfoundation.com) or tweet at the @sunlightlabs account.
For those of you headed to California instead of Texas, note than an even bigger contingent of labs staffers is currently winging its way toward PyCon. They'll be leading our now-traditional open government code sprint, looking for folks who want to contribute to Open States and/or a new, super-secret (well, not really) community project.
Merry conference-going to all -- we're looking forward to seeing some old friends, and to making some new ones.
Don't Use Zip Codes Unless You Have To
- Written by
- Tom Lee
- Date
- 01/19/2012 11:19 a.m.
Many of us in the labs found it thrilling to watch the internet community unite around opposition to the SOPA and PIPA bills yesterday. Even more gratifying was seeing how many participating websites used our APIs to help visitors find their elected representatives. This kind of use is exactly why we built those tools, and why we'll always make them freely available to anyone who wants to make government more accessible to its citizens.
Still, I'd be lying if I said we don't occasionally wince when we see someone using our services in a less-than-ideal way. It's completely understandable, mind you: the problem of figuring out who represents a given citizen is tougher than you might think. But we hate to think that anyone is getting bad information about which office to call -- talking to the people who represent you should be simple and easy! Since this comes up with some frequency, it's probably worth talking about the nature of these problems and how to avoid them.
TL;DR: Looking up congressional districts by zip code is inherently problematic. Our latitude/longitude-based API methods are much more accurate, and should be used whenever possible.
The first complication is probably obvious: zip codes and congressional districts aren't the same thing. A zip code can span more than one district (or even more than one state!), so if you want to support zip lookups for your users, you'll have to support cases where more than one matching district is returned. Our API accounts for this, but it's important that your code do so, too. We err on the side of returning inclusive results when a zip might belong to multiple congressional districts.
Unfortunately, things are actually more complicated than that. Most people don't realize it, but zip codes describe postal delivery routes -- the actual routes that mail carriers travel -- not geographically bounded areas. Zip codes are lines, in other words, while congressional districts are polygons. This means that mapping zips to congressional districts is an inherently imperfect process. The government uses something called a zip code tabulation area (ZCTA) to approximate the geographic footprint of a given zip as a polygon, and this is what we use to map zip codes to congressional districts. But it really is just an approximation -- it's far from perfect.
It's much better to skip the zip code step entirely and simply look up your location against the congressional district shapefiles published by the Census Bureau using a precise geographic coordinate pair instead of a hazy, vague zip code. Thanks to the Chicago Tribune News App Team's excellent Boundary Service project, we offer exactly this capability. If you can, we strongly encourage you to get a precise latitude/longitude pair from your users (either by geolocating them or geocoding their full address), then use it to determine their representatives.
"But what about house.gov's ZIP+4 congressional lookup tool?" I hear you asking. It's true, many House offices use this tool to determine who your representative is (and whether you're allowed to email them). Unfortunately, just because this tool is on an official site doesn't mean it's perfect. Here in the Labs, Kaitlin (who lives in Maryland) can't write her representative because the ZIP+4 tool gives incorrect results. Besides, not that many people know their full nine-digit ZIP+4 code.
So if you can, use latitude/longitude pairs. If you can't, and have to depend on zips, we'll supply results that are very, very good -- but not as good as real coordinates would allow.
Broadcasters' Public Files Should Be Published Online (and it's absurd that we're even having this conversation)
- Written by
- Tom Lee
- Date
- 01/17/2012 12:11 p.m.
Luigi passed along a couple of links to a great/infuriating On the Media segment about the new rules the FCC is considering related to the online disclosure of political ad purchases.
To run through the issue quickly: every broadcast station is required to keep a "public file" of paper records related to campaign ad purchases. These records show basic information about how an ad was purchased, who bought it and when it aired. As the name implies, the file is available for public inspection, but only if you show up at the station and ask for it.
The FCC has proposed a rule that would require the public file to be posted online. We feel that this is an obvious and overdue step, and have submitted comments to the rulemaking saying as much. After all, it's 2012--it's absurd to claim that information is "public" if it isn't also online. And this information is particularly important: with Citizens United enabling a new flood of money into our political system--with less acountability!--keeping track of the ways in which wealth is deployed to move political opinion is more important than ever. The public file is a vital source of this kind of information.
The first OTM segment, which features Steven Waldman, does a good job of explaining all of this. The second one mostly just makes your blood boil. In it, Jack Goodman, a lobbyist for the National Association of Broadcasters, makes the case that posting the public file online would represent an onerous burden on broadcast stations.
Clearly, this is nonsense. As Waldman notes, Goodman is claiming that his would be "the first industry to use the internet to become less efficient." I've seen what the public file looks like. Yeah, there's a bunch of stuff in there, but obviously not too much to fax to the FCC once a day (or, preferably, enter into a modern electronic records-keeping system--perhaps one supplied by the FCC--instead of continuing to record everything on paper like it's 1970).
But forget for a moment how ridiculous Goodman's argument is. Consider how outrageous it is that he's even making it. This is one of the underappreciated pathologies that lobbying produces. If you're an organization like the NAB and you have a staff lobbyist, whenever an issue comes along--however minor--your lobbyist can be counted on to make a fuss about it. That's what they're paid to do, right? Here we have a disclosure burden that is basically the bureaucratic equivalent of your office manager announcing that expense reports have to be filed using a webform. Yet for some reason we're now having a national conversation about it.
It's absolutely dumbfounding to have an effort to make money in politics more transparent weighed against someone not wanting to use the fax machine. And yet here we are. That's the magic of the lobbying industry.
The FEC's New Mobile Site Could Use Some Work
- Written by
- Tom Lee
- Date
- 01/03/2012 5:10 p.m.
Last Friday the Federal Election Commission announced the launch of a new mobile interface. You should try it for yourself at http://fec.gov/mobile/. The site declares itself to be a beta, which I suspect you'll agree is something of an understatement.
Let's call a spade a spade: there's no use pretending this is good. To begin with, there are obvious superficial problems: graphs lack units, graphics have been resized in a lossy way, and the damn thing doesn't work on most Android devices.
Worse, there are substantive errors. Look at Herman Cain's cash on hand. Why are debts listed as a share of positive assets? Look at the Bachman campaign's receipts. Why is "total contributions"--which should reflect the entire pie--just a slice? (It's not 50% because other slices seem to have incorrectly counted overlap, too.) Why don't any of the line items below the graphs reflect the fact that some are components of others?
We asked the FEC for comment, but so far they've declined. Once the powers that be over there have a closer look, I'm confident they'll agree that the mobile site is a mess.
It's hard to know what to say about all of this. Part of Sunlight's mission is to encourage government agencies to embrace technology more fully. We don't want to send mixed messages by jumping down their throats when they actually try to do so. Sure, we gave FAPIIS a hard time, but that was because the site's creators were obviously and deliberately undermining the idea of public oversight. By contrast, I don't think anyone who worked on the FEC Mobile site intended to do a bad job.
And of course there's a fundamental question. Obviously the bits that are relaying incorrect information are a problem. But assuming those get fixed, is a half-hearted attempt like this better than nothing? I suppose there might be some poor, twisted soul who will enjoy listening to FEC meeting audio while they're at the gym (though frankly, if such a person existed I suspect they'd already be working here). But as a general matter it's difficult to imagine anyone needing a mobile interface to a set of campaign finance data that's as narrowly conceived as this one.
To their credit, it doesn't seem as if this mobile interface was created at the expense of the organization's much more important responsibility to publish data--a mission that, by and large, the FEC fulfills ably and with steadily increasing sophistication. There's always room for improvement, but the truly pressing needs, like reliable identifiers for contributors and meaningful enforcement of campaign finance law, are beyond the reach of the organization's technical staff.
Still, it's a bit amazing to see obviously wrong numbers attached to a product that Chairperson Bauerly has been quoted as endorsing appreciatively. Among those of us concerned about America's campaign finance system and the effect it has on our democracy, there is a sense that the FEC's leadership does not take its mission particularly seriously. The release of shoddy work like this mobile site does little to dispel that impression.
Don't Forget: Our Open House is Tomorrow!
- Written by
- Tom Lee
- Date
- 10/24/2011 12:26 p.m.
A gentle reminder: our open house is tomorrow starting at 6, and we'd love to see as many of you here as can make it. Beer has been ordered, candy is being acquired, and plans are afoot for a Kinect-powered haunted painting. In short: it's going to be great. RSVP, why doncha?
Save the Date: Labs Open House October 25
- Written by
- Tom Lee
- Date
- 10/07/2011 2:06 p.m.
Jeremy mentioned it in this week's labs update, but it's worth broadcasting it more loudly: we're having another Sunlight Labs open house! It's been about a year since the last time we did this. We had a great time with you all back then, and are looking forward to doing it again.
So! Please mark your calendars: we'll be opening our doors on Tuesday, October 25 at 6pm. Expect drinks, games, technology chit-chat and more than a little Halloween-themed nonsense.
If you think you can make it, do us a favor and RSVP here. We're looking forward to seeing you there!
Announcing Superfastmatch
- Written by
- Tom Lee
- Date
- 09/13/2011 10:42 a.m.
Today I'm pleased to announce that the Superfastmatch project is open-source and ready for use. I’m excited to be posting this—I’ve been waiting to do so for a while! I think SFM is really, really cool—and I think you’ll agree once I tell you why. But first, a little bit of backstory.
We first became aware of the technology behind SFM when Churnalism launched. Created by the Media Standards Trust, Churnalism is an ingenious effort to detect when UK journalists copy-and-paste press releases into their published stories. It’s a great project, but we were even more excited by the technology behind it. Finding overlap between documents in huge corpora is not as simple a problem as you might think--it's tempting to assume that diff will manage the job, but in truth that tool is unsuitable for most types of documents.
The basic algorithmic challenge is the same one faced by those working on systems to detect academic plagiarism--a rich and evolving field in its own right. But surprisingly little of that technology is freely available.
Sunlight reached out to MST and was ultimately able to provide a grant that allowed them to open-source their code. Even better: they've been improving it. A mostly-Python implementation that needed hefty hardware is now a compiled solution that runs blazingly fast on commodity hardware (we’ve also successfully run it on vanilla EC2 instances--see the README for details).
Each instance of the system is an HTTP server. Users load documents by POSTing their text to a RESTful interface. As each document is processed, it’s normalized and split into substrings, which are hashed into unique tokens. After you’ve loaded your documents, you run an association task, which compares each document's collection of tokens against one another. Where there's overlap, contiguous chunks of text are assembled, and you can begin to inspect the parts that might be borrowed from one another. (The actual mechanics of the system are considerably more complex than this explanation, but the preceding should give you a rough idea of how things work.)
There's a demo at scripts/gutenberg.sh that loads the Bible, the Koran and ten classic novels from Project Gutenberg into the system, then finds every bit of overlap between them (it takes about 45 seconds from start to finish on my three year-old laptop).
Sunlight's particular interest is in pairing this technology with data from our Open States Project in order to detect when legislation is migrating between statehouses or from interest groups and into law. But we hope and expect that SFM's uses will extend well beyond our mission--the applications of this technology seem sure to surprise us.
The project remains under very active development. We expect a bugfix related to very large datasets to be merged into the main branch in a week or two, for instance. But Sunlight and MST are both anxious to see developers begin to acquaint themselves with Superfastmatch. And of course we're also hopeful that others might be inspired to contribute back to it. Providing the system's output as JSON, for example, is a long-planned feature that would be easy to implement and of considerable value.
For now, though, please have a look at the project repo and start thinking about what SFM might make possible for you. You don't need to look for a needle in a haystack anymore--you just need a few good haystacks.
Data Visualization Fellowship
- Written by
- Tom Lee
- Date
- 08/09/2011 10:17 a.m.
We've got a new job listing up, and I hope you'll have a look. If you do, you'll see that we're doing something new. This position came about because we decided that we wanted to create more and better data visualizations -- they're interesting, people like them, and they're a great opportunity to experiment with new technologies.
But as we started thinking through how to staff this position, we realized we didn't really want someone who was an expert in d3, or processing.js, or any other presentation technology. Don't get me wrong: finding someone with those skills for this position would be great. But we already have a bunch of talented front-end developers and designers. I think we can present answers in beautiful and compelling ways; what I could really use are better questions.
So, like I said, we're looking for something a little different. The listing says "quantitative social scientist," but you could easily substitute the "data scientist" buzzword that the tech industry seems to be embracing. Whatever you call it, what we're looking for boils down to this: we need someone with the ability to understand the questions that can reasonably be asked of our data; someone who knows the questions that people have asked of the data in the past; and who is be able to find some decent answers of her own. At Sunlight, those questions are likely to be about the U.S. government and the entities that try to influence it. Once you've got an interesting answer, we'll throw all the Javascript and CSS at it that you could ever want.
So please have a look, and if you know folks who you think would be a good fit, pass the link along to them. And if you yourself are thinking about applying, please don't be scared off by the specific requirements -- they describe what we think an ideal candidate would be, but we know that we're likely to find some surprises. This fellowship is a bit of an experiment for us, but I'm excited about the possibilities it represents.