Adobe is Bad for Open Government


So next week, Adobe's having aconference here to tell Federal employees why they ought to be using "Adobe PDF, and Adobe® Flash® technology" to make government more open. They've spent what seems to be millions of dollars wrapping buses in DC with Adobe marketing materials all designed to tell us how necessary Adobe products are to Obama's Open Government Initiative. They've even got a beautiful website set up to tout the government's use of Flash and PDF, and are holding a conference here next week to talk about how Government should use ubiquitous and secure technologies to make government more open and interactive.

Here at the Sunlight Foundation, we spend a lot of time with Adobe's products-- mainly trying to reverse the damage that these technologies create when government discloses information. The PDF file format, for instance, isn't particularly easily parsed. As ubiquitous as a PDF file is, often times they're non-parsable by software, unfindable by search engines, and unreliable if text is extracted.

Take, for instance, H.R. 3200-- otherwise known as "America's Affordable Health Choices Act of 2009", a 1017 page healthcare bill from congress. Because it is primarily published in PDF, we've got to build a special parser for it-- that bill-- in order to represent it programatically. Or Carl Malamud's IRS filings for 527 (stealth PAC) organizations: gigabytes of PDF files, all released by government. Government releasing data in PDF tends to be catastrophic for Open Government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it. When a government agency publishes its data and documents as PDFs, it makes us Open Government advocates and developers cringe, tear our hair out, and swear a little (just a little). Most earmark requests by members of congress are published as PDF files of scanned letters, leading the Sunlight Foundation and others to write custom parsers for each letter.

Yet, for some reason, Adobe feels they're essential to the new administration's mission of transparent and open government. I on the other hand feel like picketing the event they're having next week to sell their wares (hey hey! ho ho! your-binary-low-parsable-formats-for-government-data has got to go!) because in fact, they're quite the opposite. Here at Sunlight we want the government to STOP publishing bills, and data in PDFs and Flash and start publish them in open, machine readable formats like XML and XSLT. What's most frustrating is, Government seems to transform documents that are in XML into PDF to release them to the public, thinking that that's a good thing for citizens. Government: We can turn XML into PDFs. We can't turn PDFs into XML.

Flash isn't off the hook either. Government has spent lots of time and money developing flash tools to allow citizens to view charts and graphs online, and while we're happy the government is interested in allowing citizens to do this, Government's primary method of disclosure should not be these visualizations, but rather publishing the APIs and datasets that allow citizens to make their own. Only after those things are completed to the fullest extent possible should government be working on its own visualizations. While Adobe may say in their open government whitepaper:

"Since the advent of the web, an entire infrastructure has evolved to enable public access to information. Such technologies include HTML, Adobe PDF, and Adobe® Flash® technology."

This is nonsense. The fact is, sticking to open, standards based technologies like HTML, XML, JSON and others are far more important and useful in getting your information out to the public than the proprietary formats of Adobe. Here's a hint-- if the data format has an ® by its name, it probably isn't great for transparency or open data.

So don't get me wrong-- I appreciate just like the next guy that I can download a nice PDF file of an IRS form, print it out, and send it in. I think that members of congress publishing their "Dear Colleague" letters with accuracy is great and important, and I think the pie charts on the IT dashboard are really neat. But when it comes down to it, these technologies aren't helping to fully open our government. They have their place, but in terms of transparency and openness, I'm afraid they do more harm than good. Relying on them only yields frustration from the people who use the data government publishes the most, and they should be considered a bell or a whistle on top of the foundation that an agency should do to be fully transparent: putting data online, obeying the 8 principles of Open Data to the fullest extent.

Update (3:10pm): At the strong urging of our Policy Director, I'll add this caveat: any time Government decides to release data to the public, we're glad that government has taken a step forward. But the PDF file format, especially when it comes to data, and large documents like bills, is something that government should strongly consider open, machine readable, parsable alternatives to. There are plenty, and we're happy to help find them for you.

Update (3:20pm): PJ Doland has the right answer. PDF by itself is insufficient. So is Flash. But what makes PDF in particular bad is that more often than not, you can turn XML into a human readable PDF. But you can't turn PDF into a machine readable XML/JSON/whatever file.

Discussion

  1. Colin Dean 10/28/2009 1:41 p.m. (permalink)

    Amen! PDF may be available everywhere to everything for free these days, but it still requires extra software not always installed on computers and isn't easily readable my mobile phones. Supplying the XML and an XSLT to make it human readable would be a much better option, require less bandwidth, and would use a much more open standard--XML--than PDF.

  2. Ben Welsh 10/28/2009 1:57 p.m. (permalink)

    I agree. And if you'll forgive the plug, I actually blogged a local version of this rant earlier in the week regarding how our public health agency in Los Angeles publishes swine flu vaccination locations.

    http://www.palewire.com/posts/2009/10/24/la-county-public-health-should-just-say-no-pdfs/

  3. Adrian Holovaty 10/28/2009 2:32 p.m. (permalink)

    Thanks for writing this, Clay. For Adobe to promote itself as useful to the open-data movement is absurd. If I had a dollar for each hour I've spent trying to finagle raw data out of PDFs, I could afford Adobe Photoshop.

  4. Hugh 10/28/2009 2:42 p.m. (permalink)

    "Government releasing data in PDF tends to be catastrophic for Open Government advocates, journalists and our readers because of the amount of overhead it takes to get data out of it."

    the only thing more catastrophic than govts releasing docs as pdf is govts not releasing them

  5. Ryan 10/28/2009 2:48 p.m. (permalink)

    Great column - I've been thinking this every since those signs came up

  6. Hugh 10/28/2009 2:51 p.m. (permalink)

    "Here at Sunlight we want the government to STOP publishing bills, documents, and data in PDFs and Flash and start publish them in open, machine readable formats like XML.' Whoa. Really? You want them to stop publishing? Careful what you wish for. I know what you mean of course but with they way this request is stated you just shot over most people's heads. We need to be more specific and ask for a common document format that is XML-based like OpenDocument or equiv. Many important documents like contracts, bills and letters are not well-structured data as such.

  7. Clay Johnson 10/28/2009 2:57 p.m. (permalink)

    I think you missed the last half of the sentence, which is "to start publishing them in open, machine readable formats like XML"

  8. Hugh 10/28/2009 3:03 p.m. (permalink)

    "....open, machine readable formats like XML" More specifically, please?

  9. len 10/28/2009 3:04 p.m. (permalink)

    We lost this fight with Adobe and Steve Zilles at a NIST conference in the 90s when Adobe was trying very hard to kill SGML. This isn't new news. It is a well-known but unadmitted problem in the Beltway. This is also one of the side effects of the web pioneers zealotry with respect to HTML. For some reason, I just can't care anymore, so good luck.

  10. Hugh 10/28/2009 3:10 p.m. (permalink)

    I love talking about XML, but I think we need to stop talking about XML. Ever noticed that glazed look it elicits? We''re taking on a heavily entrenched, familiar for-profit proprietary turn-key corporate offering. We need to ask for a specific, real-world alternative, not a technology.

  11. PJ Doland 10/28/2009 3:12 p.m. (permalink)

    The problem isn't with the PDF file format. PDF is a near-ideal format for long documents that are most likely to be printed and read by humans.

    What you should be promoting is cross-media publishing workflows, so the document can be accessed in multiple formats that are all generated from the same canonical source material.

    It shouldn't be a choice between PDF, XML, JSON, or HTML. It should be all of the above.

  12. Sam Caldwell 10/28/2009 3:16 p.m. (permalink)

    We need XML-based standards for all government informaiton. Publishing the standards (xml schema) and allowing webservices to distribute the data will make FOIA more affordable for taxpayers, as a single data source of public information could be accessed by multiple requesting entities without the need for human intervention.

  13. Peter Krenesky 10/28/2009 3:27 p.m. (permalink)

    @PJ Doland

    check the screenshot. The XML document is the canonical source material. They translated it without providing the original XML file, and that is the problem.

  14. PJ Doland 10/28/2009 3:34 p.m. (permalink)

    @Peter Krenesky I saw that before posting. I actually don't think we're in disagreement. What I was responding to was more the statement in the post that "Here at Sunlight we want the government to STOP publishing bills, and data in PDFs and Flash and start publish them in open, machine readable formats like XML and XSLT." What I'm saying is that there's nothing wrong with providing formats that are optimized for human (and not machine) readability, provided there is also a machine-readable option. The problem isn't with Adobe. They support some really good XML integration for cross-media publishing from within InDesign. The problem is with the document production workflows that are currently in use.

  15. Kevin Merritt 10/28/2009 3:35 p.m. (permalink)

    I echo PJ Doland's comment. That's why every dataset on the Socrata social data platform can be downloaded in XML, JSON, PDF, XLS or CSV format. Oh, and you can get a REST API for any dataset too. Pick your flavor and move on with the real business you're trying to get done.

  16. Peter Krenesky 10/28/2009 3:41 p.m. (permalink)

    Interesting timing of the adobe event. Government Open Source Conference (GOSCON) is the following day, same venue. Picketing sounds like a good idea: "Come back tomorrow to learn about real open technologies!"

  17. Peter Krenesky 10/28/2009 3:47 p.m. (permalink)

    @PJ Doland

    Oh, your original comment makes more sense put that way. I took the human readable format for granted, assuming that easily parsed data would be put into such a format by a 3rd party anyways. e.g opencongress.org.

  18. Michael Friis 10/28/2009 6:43 p.m. (permalink)

    You guys have it easy. Here in Denmark Parliament publishes many ancillary documents as PNGs: http://www.ft.dk/dokumenter/tingdok.aspx?/samling/20091/spoergsmaal/s40/svar/651876/741801/index.htm

    (Cue Monty Python and The Four Yorkshiremen: http://www.youtube.com/watch?v=13JK5kChbRw)

  19. Tom MacWright 10/28/2009 6:56 p.m. (permalink)

    Amen, sir. I spent several days parsing a gigantic PDF just seeing how badly the format can obscure data. I think there are reasons, other than evil, it has persisted, though.

    1. HTML does not communicate 'consistent presentation', because of IE, browser widths, weird print output, and other reasons.
    2. The ability to do good forms in other formats - some would say the XForms project, has been mired in crazy standards-committee deadlock for years and obviously hasn't gained much acceptance. At the same time, Adobe just made forms work around their products, and obviously people who wanted forms appreciated this.
    3. In some cases, PDF is actually used to prevent tampering. Silly, of course, because anyone worth their weight can mess with one (in Inkscape, Acrobat, etc.), but, still, it's a common perception that PDFs can be far more read-only than any other format, and that's a big deal for big organizations.

  20. Tom MacWright 10/28/2009 6:57 p.m. (permalink)

    Apparently a subset of markdown syntax is supported:

    1. HTML does not communicate 'consistent presentation', because of IE, browser widths, weird print output, and other reasons.

    2. The ability to do good forms in other formats - some would say the XForms project, has been mired in crazy standards-committee deadlock for years and obviously hasn't gained much acceptance. At the same time, Adobe just made forms work around their products, and obviously people who wanted forms appreciated this.

    3. In some cases, PDF is actually used to prevent tampering. Silly, of course, because anyone worth their weight can mess with one (in Inkscape, Acrobat, etc.), but, still, it's a common perception that PDFs can be far more read-only than any other format, and that's a big deal for big organizations.

  21. Ian Yorston 10/28/2009 7:52 p.m. (permalink)

    Two great calls here:

    1. "We need to ask for a specific, real-world alternative, not a technology."

    2. "[a] document [that] can be accessed in multiple formats that are all generated from the same canonical source material."

    So what we have is a Server-Side problem and a Client-Side problem.

    Which, in this web-centric world, really means that we have a Client-Side problem because that's what Users see and that's what Managers & Politicians understand.

    Google Mail seems to be one of the best Clients around at the moment because it tries very hard, via Google Docs, to convert any attachment into a form that the user can view & re-purpose without opening any additional software.

    I think we need to focus on the client.

  22. James Salsman 10/28/2009 8:58 p.m. (permalink)

    Why even bother with XML? Why not just HTML?

  23. Jeff Sonstein 10/28/2009 9:02 p.m. (permalink)

    XML-derived data-formats are "specific, real-world alternative(s)" to this stuff. This need not be either a client-side or a server-side problem, This data can be published in a well-structured form, and thus can be both analyzable (by softwaare) and readable in browsers by humans. This is not "rocket science".

    This is really just a "we want to sell software that uses our proprietary file formats" problem.

    jeffs

  24. CoCreatr 10/28/2009 9:38 p.m. (permalink)

    "Let my dataset change your mindset", said TED speaker Hans Rosling, one more advocate for open government data.

  25. Kevin Curry 10/28/2009 9:53 p.m. (permalink)

    If only for every "Printer Friendly" link there were also a "Machine Friendly" link. I empathize and your focus on Adobe is not without due cause. But it's much more than Adobe or any one company. It's more than documents and presentation media. We have to undo decades of application-centric world view that did not recognize all software applications as input-output machines. Data went in and were never intended to come out, whether by design or by circumstance. UML modeling tools are the albatrosses around my neck because they are used by thousands of enterprise architects in government. Visio is masterful at reducing useful data about enormously complex systems to nearly worthless pictures. Others export XML but co-mingle their application's data with subject matter data. Scraping malformed HTML tables that come up 15 records at a time and are obfuscated behind server-side scripts isn't a walk on the beach either. That's all changing now. Now we're all about decoupling data from proprietary and domain-specific applications and transforming it into well-formed, human AND machine readable text that can be used for any application or purpose (and by any device). In fact, the UNIX command line had it right decades ago; it was pipes. Here's a great example of how it should be: http://thomas.loc.gov/home/gpoxmlc111/ Click on any XML file in your browser and it will come up in a very human-friendly format (thanks to external stylesheets). Clay, tell us again about the story of the database made from scribblings on cocktail napkins.

  26. Preston Austin 10/28/2009 10:07 p.m. (permalink)

    First priority - offer unoptimized crappy looking baseline delivery that is commonly readable by human and quasi-readable by machines.

    Keep that in all cases, and offer optimizations on top of it. Machine readable or semantic markup is an optimization - not a baseline need. What is baseline? Its not hard to find decent candidates: importantProse.TXT formattedCrap.HTML+CSS moreFormattedCrap.RTF importantImage.JPG tabularData.CSV wellStructuredData.XML audio.MP3 video.MP4

    Give me any or all of the above, and anyone can .docx, .swf, .pdf, .msft, .appl, .adbe, and .whozit and crazy complicated government .DTD it till the cows come home and beyond on top of that. Frankly, fancy XML has been more of a barrier than a panacea for publishing in all projects I've been involved in (about 100). I agree with the multi-output single source ideal - but I'll take just the text today while that reality sorts itself out.

    Computers don't bore easily, converters are abundant, we'll be able to impute a lot of structure into legacy material, disks are cheap...lets use them. Today - everything can read this comment.

  27. Cyber 10/28/2009 10:44 p.m. (permalink)

    What is this "custom parser" foolishness? My one PDF reader can read most PDFs. Have you ever wondered how Google manages to make PDFs searchable? Turns out, they use open source software to parse them. Then they index the extracted text, and map the index to the original document. Not all that hard. In fact, why don't you just wait for Google to parse and index the documents for you?

    http://code.google.com/p/ocropus/ http://www.foolabs.com/xpdf/download.html

  28. Bill Conniff 10/28/2009 11:10 p.m. (permalink)

    I vote for xml. I never liked the government adopting Acrobat for similar reasons as the ones you point out. It was also a form of government subsidy to Adobe by generating a huge market for their pdf writer.

  29. thinman 10/28/2009 11:35 p.m. (permalink)

    I worked on an OpenGov/eGov't gig where there was a big push to go paperless. Adobe tools were key in the workflow and final output. As a subcontractor, I was impressed with how we started with an end-product: something as useful, if not moreso, than the paper - and could code the system to make PDFs as open, extractable, XML, and indexable. Just because an agency and or it's customers/data consumers are ignorant of the breadth and depth of a technology doesn't mean it's a failure of that technology. PDF and Flash as document delivery formats are only limited by the knowledge and imaginations of those employing them. For instance, any Flash doc can have multiple print, XML, and other formats. The production of which can be as automated as any other format. Singling out Adobe technologies seems intended to raise eyebrows, garner sympathy from like-minded folks, and raise the hackles of fanboys, - all great for buzz, but a far less compelling oGov argument - but it doesn't seem like a very useful, cogent, or insightful statement in the context of open government.

  30. Clay Johnson 10/29/2009 10:38 a.m. (permalink)

    Thinman, I'd love to see an example of the technology you describe.

    Primarily my complaint isn't with the technology specifically but rather the way it is marketed. I say PDF has a place-- in publishing small documents, things where print accuracy is needed-- but in publishing datasets, Government often publishes them online in PDF form, and says their job is over. If they, instead, published them in raw text files, or heck, even .docx files, it'd be easier on the Sunlight foundation.

    But that Adobe is spending millions of dollars advertising here. Read their whitepaper. It' full of fud, doesn't mention anything about data standards, and talks about how they should actually encrypt and watermark data to ensure its authenticity.

    That's just plain un-american.

  31. Tom MacWright 10/29/2009 12:39 p.m. (permalink)

    "What is this "custom parser" foolishness?"

    Haha, indeed. Here are a few problems:

    Yes, you can open documents with your PDF reader and read them yourself, but equating this to extracting data from those documents is foolishness. Tables in PDF documents are not tables - they are tab seperated... some number of spaces seperated... sometimes just positioned... sometimes images. Regardless of what they are, they are never tables, and never parsable like CSV, TSV, XML, or even Excel.

    And... why don't we wait for Google to index the documents for us? For one thing, we aren't looking for indexing (just like, in point #1, we aren't looking to read the documents), we're trying to get data. Hugely different. And... many PDF documents we're parsing are not public and are too large to try to import into Google Docs, etc.

  32. thinman 10/29/2009 1:07 p.m. (permalink)

    Clay,

    I agree, the marketing often over-simplifies. Marketing is supposed to simplify, but it often belies the wonderfully enabling complexity of underlying technologies.

    Having served in the public sector as a servant and a contractor for my entire career, I'm often frustrated with the tendency to treat document management solutions and information delivery and distribution systems as simply another print button.

    Starting with the objective of machine-readable, open data formats should be essential. All too often, it's an afterthought. "All's we need to do is put it online. Bake in the format. Lock it up, don't let users mess with my layout. Print to PDF and upload. Done."

    The company I subcontracted for may still have a whitepaper of one the apps I designed. I'll poke around and see if I can find it. It's an Adobe AIR app that integrates with ColdFusion and LiveCycle to enable online/offline participation in government process flows. Entirely paper-based processes converted to packages of inter-dependent PDF packages.

    We looked at a lot of potential solutions, too. Acrobat, ColdFusion, Flex, AIR, and Livecycle gave us the ability to facilitate any amount of intended openness. Unfortunately, the nascent oGov movement is rarely even considered when designing systems. But PDFs designed from the outset to be machine-readable XML are awesome. Lots of metadata, and a level of intra-document granularity that is remarkable. Down to the field level: what's it for? what category of info does it fall into? what's the character count of the stuff in the field, and what's the content that's actually in it? All in XML. Tasty. Unfortunately, my unbridled enthusiasm often exceeds budgets and sometimes the myopathy of the client. Getting folks jazzed over the ability to make entire repositories of documents searchable, catalog-able, etc., down to the line item content level is a hard sell when all they often wanna do is eliminate paper waste and post some stuff online. Print to PDF. Bleh.

    Dude, what if a civic enagement enterprise wanted to facilitate an unprecedented level of citizen enlightenment and participation by aggregating data across agency systems and services into actionable information? Oh wait. That's the point, isn't it? Somebody ought to tell the marketers to begin touting this half as much as the magic print to PDF button.

    Thanks for your thoughtful response and your tireless work. You're much appreciated.

    • M

  33. The Doctor 10/29/2009 4:01 p.m. (permalink)

    I'd settle for being able to fill out some of the forms they make available as .pdf files and then save them to print out again later because the first three or four copies always seem to go missing after being submitted.

    /SF-8[0-9]?/, I'm looking at you.

  34. bill shelton 10/29/2009 9:30 p.m. (permalink)

    I agree PDF and Flash are totally inappropriate for data exchange. Personally, I'm not too crazy about XML either, having participated in some sausage making XBRL efforts and have written enough XSLT and XPATH that my eyes are almost shot - "'tis but a scratch". I now prefer the light-weight simplicity of JSON.

    With that said, I think we must agree that the data will eventually be used by a human to make important decisions that affect their life. Open Government is for "The People", right? All that data needs to be visualized and presented somehow to make sense. If I could interact with an agency's data much like a Yahoo! Finance (http://finance.yahoo.com/echarts?s=GOOG), that would be ideal.

    If Adobe is encouraging agencies to use PDF and Flash as "the solution" to President Obama's initiative, they're missing the mark and I don't think it will get very far. I could be wrong but, the white paper they've published surely seems to suggest a strategy towards a PDF/Flash solution and I wonder if it's just part of something larger they have hopes for ... ?

    User-centric development is critical; that is, we need to deliver data quickly and unobtrusively. For Open Government to be successful, we're going to need both clean data and a good way for people to interact with it.

  35. Mike Brunt 10/30/2009 12:01 a.m. (permalink)

    As is always the case, PDF and Flash have gone out and prospered because they are good technologies not because they are autocratic and without alternatives. We are shackled to the browser which after all of these years does not have a universal-usable standard we can relay on, being bellicose about Adobe is meaningless, just in the same way the 60's was wasted by idealists with no realistic alternative to capitalism. Let us focus efforts, if there be efforts, on actually ratifying browser standards. Until then, PDF and Flash paradigms are reliable to the people reading then, the people.

  36. chris arkenberg 10/30/2009 2:32 a.m. (permalink)

    I worked at Adobe in the Acrobat group for a number of years managing a QE team. As in most companies, there are a lot of well-meaning folks there doing interesting things but stifled by resource constraints and the needs of the primary enterprise customers.

    Setting aside Adobe's unfortunate and, frankly, dangerous strategy to get the US Gov yet more dependent on Acrobat, the main philosophical problem is that PDF is really an image format more than anything else. It's designed to present mostly textual content as digital paper. It was never architected to consider that machines might want to crawl the actual data in the document. Furthermore, over the years Adobe has added many additional document affordances such as signing, font embedding, forms, and others that require a hefty chunk of code to parse & render. Between these two factors you get a mature, legacy-filled, mostly-opaque platform that does a great job of ensuring everyone sees exactly the same document but has next to no consideration for the data within the document.

    I made these arguments to Adobe both internally and after I left but massive business units like acrobat move very slowly and cautiously. There are tremendous opportunities to bring PDF into the world of happy friendly structured data but the realities of reporting to shareholders and massive enterprise customers like the IRS limit the evolution considerably. And honestly, the platform makes a ton of money so there's not a lot of incentive to take any big format risks. Acrobat's consistency pays for innovation in many other areas of Adobe.

    FWIW, my thoughts on a keeping PDF relevant: http://www.urbeingrecorded.com/news/2009/01/12/how-to-keep-pdf-relevant-flash-and-semantics/

  37. Kirk Keller 10/30/2009 1 p.m. (permalink)

    I'd to just say that I agree with PJ Doland and Kevin Merrit. I work for a state agency. Many of our products available for public consumption are in pdf because the process by which the data is gathered assumes a traditional print product. Until that changes, PDF will be quite attractive as a final web product.

    Also consider that government has, in this case, essentially two audiences: People who aren't particularly savvy in technology. Without a good human readable product, government is inaccessible. These folks want something that requires little technical expertise on their part and renders something that they can print and look just like a traditional printed document. I believe it would be hard to argue that PDF does not accomplish that well.

    The other audience are those who need machine readable data..either for reasons of accessibility or data mining.

    I think Ian Yorston lays out two good goals that would sure help me in terms of serving these two audiences.

    I'd also be keen to follow any place where this discussion may continue in terms of providing a solution to the items Ian suggested.

  38. Dave Watts 10/30/2009 1:19 p.m. (permalink)

    As a guy who works heavily with Adobe technologies, including Flash, Acrobat, LiveCycle and ColdFusion, I was a bit surprised to find myself agreeing with about half of your argument. And I agree with your vision of the ideal ultimate outcome of open government. Unfortunately, half an argument does not an argument make.

    First, the choice for many government automation projects is whether they'll be automated at all. A huge amount of government work is driven by existing paper-based workflows. To just say "this should be XML" without providing a path to get from paper to this XML wonderland simply isn't going to happen. Government agencies often have neither the budget nor the expertise to make that happen.

    With Acrobat and LiveCycle, Adobe provides an alternative that lets these agencies work their way from a paper-based process to an automated one in a realistic manner. Organizations can take their paper forms and other documents, scan them into PDFs, and get started.

    Keep in mind that, for instance, anybody can take a paper letter, put it on a scanner, and use Acrobat to create an OCR'd PDF from it with no custom programming, etc. That's very compelling compared to a process requiring a programmer to build something - anything - to get data in a more open format, and PDFs - even with images - are easily searchable as long as they also contain corresponding text.

    Second, as far as Flash goes, HTML even with AJAX does not provide equivalent functionality. It's close, and getting closer all the time, but it's sure not there yet. Also, AJAX solutions are browser-dependent in a way that Flash solutions are not - practically speaking, I can run Flash Player in almost any browser on any computer, and my Flash application will work the same way. Adobe has been working very closely with handset manufacturers and mobile OS developers to get the latest Flash Player 10 out, and the expectations are pretty high that it'll be available in 1H 2010 on almost all new handsets! I think it's kind of amusing that you provided a screenshot from the iPhone - one of the least open platforms, where you can't install Flash Player simply because Apple sees it as competition; where you can't install other browsers AT ALL - in an argument for standards.

    ... continued in next comment ...

  39. Dave Watts 10/30/2009 1:20 p.m. (permalink)

    ... continued from previous comment ...

    Finally, and most importantly, you state

    ... while we're happy the government is interested in allowing citizens to do this, Government's primary method of disclosure should not be these visualizations, but rather publishing the APIs and datasets that allow citizens to make their own. Only after those things are completed to the fullest extent possible should government be working on its own visualizations.

    As a programmer myself, I sympathize with this argument. But this is wrong, wrong, wrong! Government should be serving regular citizens first, THEN providing APIs. Otherwise, we regular citizens who may not be programmers are beholden to unelected, unaccountable third parties to get access to the data that we own. While you guys at the Sunlight project are great and trustworthy and all that, you simply can't be expected to provide end-user interfaces to everything, and if we the citizens depend on third parties for data access, we'll get to see what those third parties want us to see, and the biases and agendas of those parties will be implicit in what we see. And, in fact, if government actually followed this advice, a lot more data simply wouldn't see the light of day.

    Dave Watts, CTO, Fig Leaf Software

  40. Joe "Floid" Kanowitz 10/30/2009 1:47 p.m. (permalink)

    I hope to find time to post further observations later, but if I don't - it's worth noting that the National Archives and Records Association (http://www.archives.gov/) has published some guidelines concerning the use of PDF (and other formats) to make it... not-completely-machine-unreadable - or at least possible to process at all someday, with enough effort.

    It's important to recognize that many aspects of government and governance are still centered around the printed page. Legislation actually produces some of the most structured and layout-independent data (and if they do have a simple XML DTD but only publish in PDF, shame on them), but try representing a court filing with a combination of text and scanned exhibits - which may need to render identically on every screen and on paper in open court - and you can see why PDF became the least-worst thing going.

    Unfortunately, throwing even the best-structured XML (or SGML) at the problem doesn't help unless authoring and workflow software is available that "normal humans" can actually use.

    Related to this, I read Chris Arkenberg's post and link above with some chagrin - concerning structured data, our state's courts made an initial attempt at online forms using a turn-of-the-millennium Adobe product that "properly" separated forms and rendering (the sole 'viewer' for which was an ActiveX control). That system left users floating in structured data (.XFDs), but it only made it convenient to save the .XFDs, but not the accompanying form - if I recall, the software was intended for 'enterprise' use and this was a rights-management 'feature'. The result might have been convenient for data-miners, but on a small scale it's no fun if you can't remember what <field> through <field32> were:

    &lt;AMTBOX1&gt;0&lt;/AMTBOX1&gt;
    &lt;AMTBOX2&gt;0&lt;/AMTBOX2&gt;
    &lt;AMTBOX3&gt;-1&lt;/AMTBOX3&gt;
    &lt;CLMOTHER&gt;0&lt;/CLMOTHER&gt;
    

    Notably, that system left non-technical users stranded after a brief transition period when the 'original system' was hosted alongside a new batch of freshly-authored PDFs. [Rather than conducting a batch conversion and creating PDFs that could accept the XFDs - perhaps that would've required a larger payment to Adobe.] At least PDF got rid of the ActiveX dependency and the single files saved by normal humans will preserve both the data and the layout.

    (to be continued, after all!)

  41. Joe "Floid" Kanowitz 10/30/2009 1:48 p.m. (permalink)

    (...continued from previous comment)

    'Clouding up' a system sounds suspiciously like the original implementation described above - the layout data lived in 'the cloud,' and when that cloud burst, users were screwed. I'm being harsh; Arkenberg is obviously more concerned with adding collaboration in 'the cloud,' providing the option to stuff fields through a layout-independent interface - though why on earth depend on Flash for that? - and making them easier to scrape. However, the subtleties of specifying what should and shouldn't live 'in the cloud' can elude otherwise sane policymakers and even developers; note that there are serious proposals in the world to provide emergency communications using Twitter.)

    Rather than layout-independent formats, what the world is missing is a credible alternative specification for keeping layout and structured data separate but encapsulated - and 'nailing down' the layout enough that it should reproduce as well as a physical piece of paper for evidentiary purposes. ODT is not exactly built for position-perfect (let alone pagination-perfect) layout. You can throw XML sauce at the problem but you still have to solve all the same layout issues, by which point you might have reimplemented PDF (if slightly less cruftily).

    It's not easy, but until society does go paperless - and everyone learns to make references by anchors or byte counts rather than page numbers, and to diff text or compare hashes rather than holding two pages up to the light - the layout aspect is going to remain extremely important, and few parties other than Adobe can be bothered to tackle the problem. [At this point - unless you're Adobe, how much money is really left in paper and page definition languages? Enough to fund the development of a more agile alternative? I suspect it'd take government intervention (or demand on the scale of a government) to get enough of the right people interested in dead-trees again.]

  42. Dave Watts 10/30/2009 2:35 p.m. (permalink)

    One thing I overlooked that's worth pointing out is that, as far as forms are concerned, the "PDF" format is, actually, XML - XFA, to be specific.

    http://en.wikipedia.org/wiki/XFA

    When an XFA form is filled out, the data itself is a separate XML document. Adobe LiveCycle uses XFA extensively. XFA forms can easily be rendered as HTML.

    Dave Watts, CTO, Fig Leaf Software

  43. Steve Holden 10/30/2009 6:37 p.m. (permalink)

    Adobe's approach reminds me of a Bill Gates talk from many tears ago in which he talked about "our open Windows interface". Sadly the marketing dollars behind this misinformation will mean that people take it seriously. I'll be fighting the good fight at GOSCON next week!

  44. Joy Fulton 10/31/2009 9:58 a.m. (permalink)

    I agree with some of the conclusions made here, and add that the web metrics of the pages and files visited on websites on any given month support that conclusion.

    For the federal government website that I work on, monthly webpage visits far outweigh the number of files downloaded monthly.

    This holds true for any non-html document, whether that document be pdf, ppt, doc or xls. Why that statistic exists is a matter of conjecture, but it seems to indicate that citizens prefer html over any other format. Whether it is due to the speed in which pages vs. files load, SEO issues, or the usability of web pages over files may all be considerations.

    Of course these observations are mine only and do not reflect those of my employer.

  45. DjacK Height 10/31/2009 3:52 p.m. (permalink)

    I agree with Dave Watts. This article is stupidly bias against proprietary technologies, such as Flash and PDF, completely needlessly. I would guess that the writer is probably some techi-liberal who has some grand vision of open source socialism taking over the world or something, and as such will take any and every opportunity to attack that ever so evil "proprietary web". What a bunch of bogus crap! Adobe, Flash and the PDF are not the problem, and this exact same call for open options could be made without your silly claims of picketing Adobe. The fact that Adobe is putting so much time and energy into open government is absolutely a good thing, there is nothing wrong with them promoting their products, and the bitter truth (at least for you) is that these Adobe products make perfect sense for the NORMAL consumer. The vast majority of people using this data could give a rat's ass about it being machine readable, and similarly probably want it in a format that they are used to and already have installed on their machine like PDF. In terms of the accessibility problems with Flash, I can tell you it would not take much code to be able to activate the usual keyboard shortcuts for modifying text like HTML in a browser, and even to make that text searchable for search engines.

    So again what you have here is a legitimate call for a machine-readable format so that techie geeks and open organizations can parse and use the data being hijacked by open source bias. Instead of beating on your favorite whipping boys (Adobe, Flash, PDF), you should instead be praising the great things they are doing while pushing for more options, such as xml format releases of the documents. The simple fact is that Adobe's technologies can make this content come alive and be interactive and interesting for the common man, while your open standards look like crap. Your begrudging this shouldn't lead you to nay-say the cool Adobe technologies, but rather to work with them to get what you want as well.

  46. Ryan Riley 11/01/2009 9:37 a.m. (permalink)

    What about swapping the 'P' in "PDF" for an 'R' for RDF? The Linked Data and Semantic Web movements are even better suited to data interchange than raw HTML and JSON. Given the Whitehouse just switched to Drupal, which has an RDF module, I think it would be an ironic twist on Adobe's marketing and an easy sell.

  47. Jeffrey 11/01/2009 12:39 p.m. (permalink)

    First, the choice for many government automation projects is whether they'll be automated at all. A huge amount of government work is driven by existing paper-based workflows. To just say "this should be XML" without providing a path to get from paper to this XML wonderland simply isn't going to happen.

    You're talking like the government is some sort of uncontrollable force of nature rather than an entity that works for us that that we pay for.

  48. Sigivald 11/02/2009 2:15 p.m. (permalink)

    If they're scanning letters into PDF now, the alternative is not "magically making them not scan things", but "getting a jpeg of the scan".

    Jeffrey: Correct. He is. Because in practice that's exactly how it works. Stomping your foot and demanding XML won't work.

    Someone has to scan and OCR it, or manually convert whatever the producer makes electronically (when it's available) into "proper XML".

    XML of some format nobody has yet decided on, with tools that don't exist (and haven't been approved for government use or paid for, etc.), that you can't just "save as" or "export to" in any software anyone uses anywhere.

    I can't imagine why it isn't being done - other than because there's very little demand, very little benefit (no, notional openness and transparency from letting search engines that aren't Google* look at content a little easier is "very little benefit" for people off the reservation), and immense costs and hassle.

    • I say this precisely because Google search seems to not only parse but translate to HTML pretty well. If that's not true of all PDFs, maybe you should rephrase the demand to be that they use "better PDF", not "dump PDF because (r)!".

  49. Stv 11/02/2009 2:54 p.m. (permalink)

    Maybe we should be taking a slightly different approach. Let's applaud Adobe for working with the government in putting out documents that are massively human readable (PDF - I have nothing good to say about the flash-as-open-data idea). They've created this huge body of documentation that's great for offline use by humans.

    But! Because they've created this single point-of-entry for all this data, they have the opportunity to be the biggest heroes of open data ever by simply editing their OWN format to allow it be machine-readable. I've no idea what the "code" behind a PDF is, but, presumably, there's markers to ID images, headers, paragrpahs, etc. If Adobe all of a sudden released a PDF-interpreter API which made PDFs as useful to machines as they are for humans, we'd all be thrilled. So rather than just bash Adobe, maybe let's say:

    Great first step - you've convinced gov't to make all their docs human readable. Now take step 2. We'll help you! We'll use your API. We'll give you free press for doing this good thing. and so on.

  50. John.B 11/02/2009 3:38 p.m. (permalink)

    Flash should be banned until Adobe can make it perform on any number of platforms. And by "perform", I mean less than 100% CPU utilization and no locking up the web browser (or a tab inside the web browser for you Chrome fans).

  51. Brian Duffy 11/02/2009 4:23 p.m. (permalink)

    The problem isn't PDF, it's the lawmakers providing the documents. The government could easily embed readily-parsable XML tags in PDFs, but they don't.

    PDF is an ISO standard, Adobe makes the specs fully available for free, and there are a myriad of ways to embed data within PDF for easy parsing. In fact, several state archives, the US Courts and other organizations are doing just that, right now.

  52. Sean Foushee 11/02/2009 5:29 p.m. (permalink)

    I'm not sure I see the disconnect here. If you can easily parse XML into a PDF, then why do you even care about the reverse? Just have two feeds available, the human usable, ISO standard PDF generated by the XML, and the machine-readable feed based solely on the XML.

    PDF isn't going anywhere in the near future, and as more devices, such as e-book readers, gobble up that format you won't be able to sell many content providers on the idea that everything must be open XML or even JSON (talk about a bore fest for your clients when you try to describe that format).

    The only area in which I can see your point is if you have a ton of already generated PDFs that need to be converted back to XML, but again this seems more like an issue to handle with the client to ensure they understand what format you need content provided in.

  53. xoa 11/02/2009 7:13 p.m. (permalink)

    Here's a hint-- if the data format has an ® by its name, it probably isn't great for transparency or open data. While I think the rest of your thoughts have merit, this one part I must absolutely disagree with. You are seriously confusing IP law here, specifically with regards to trademarks. Trademark law is absolutely 100% valuable for society, and in fact the only way to have a truly open standard is with that little ®. After all, with a open standard, anyone can implement it, and anyone can modify it in any way they wish. How then is somebody to know that a given implementation actually does, in fact, follow the standard and hasn't been embraced and extended or made subtly incompatible or whatever? The answer is that only implementations that conform may use whatever the agreed mark is. Anyone can implement a new open source OS, using significant existing code, if they want to. But they can't then just go call it "Linux" or name it "Red Hat whatever" even though those products are open, because they are protected by trademark which protects against deceptive uses.

    So a minor nitpick, but important I think. Technically, a closed standard doesn't really need trademark protection, because no one can make a distorted alternative legally anyway. In contrast for an open standard the name and accumulated community reputation is the only thing that differentiates.

  54. Hamranhansenhansen 11/02/2009 7:57 p.m. (permalink)

    HTML is the publishing language of the World Wide Web. If you publish something and you don't provide HTML, that is a failure, pure and simple. It's the one universal format we have.

    XML is a great way to share data, but if you are sharing documents, use HTML. Programmers know what to do with XML and that's great, but it does not beat HTML and a Web browser for the typical user. Obviously, an HTML document built from XML data and you share both is a great solution.

    PDF is not a way to share documents, it's a way to share PRINTOUTS. Very, very different. If you give me a PDF you did not give me the document, you gave me a PRINTOUT of it. I have no use for that. Most people under 40 have no use for that.

    Flash has already been obsoleted by HTML5 and the technical problems that surround browser plug-ins of all kinds, including their lack of portability to most of the platforms the Web runs on, their crashing of the browser, their security implications, their impact on usability and accessibility, and their inability to use 3D and video decoding hardware which makes them wear out a device's battery. And it takes only about 10% of the work and code to make a video player in HTML5 than it does to make it in Flash 10, and the HTML5 version will be much easier to maintain, is more accessible, runs in the device's hardware, and many other advantages.

  55. Dave Watts 11/03/2009 3:27 p.m. (permalink)

    You're talking like the government is some sort of uncontrollable force of nature rather than an entity that works for us that that we pay for.

    No, rather that the government has limited funds and primary missions that it must accomplish. Publishing data in easily-consumable formats is very low on the list of priorities for every government agency I've worked with. Publishing data at all is peripheral to what most government agencies do.

    But! Because they've created this single point-of-entry for all this data, they have the opportunity to be the biggest heroes of open data ever by simply editing their OWN format to allow it be machine-readable.

    This would require Adobe to be able to go back in time and change the existing format. The PDF file format does support the use of tagging, which makes deconstruction easier, but if you didn't tag your content when you created the PDF, Adobe can't magically make it tagged afterward.

    Flash has already been obsoleted by HTML5

    Hey look everyone, someone's posted a message FROM THE FUTURE! Time travel seems to be a recurring theme in this post.

    The plain fact is, HTML 5 support varies significantly from one browser to the next, and many end-users still run IE 6! Of course, Flash content runs fine for those IE 6 users, because everyone upgrades Flash Player if for no other reason than to view YouTube.

    Dave Watts, CTO, Fig Leaf Software

  56. James D. McCartney 11/05/2009 9:43 a.m. (permalink)

    Almost all of the comments overlook a fundemental premise of the government position. Not only does the information have to be easily available to the citizenry (not just the techies), but it has to be something that the average person can look at and have confidence that it is accurate and complete (not redacted). If the government did as you asked and provided the information in some easily maleable format, you would be the first ones clamoring that the government was hiding something and that the information had been somehow altered. The fact is, with it in pdf, you have a reasonable level of assurance that it is the original document.

    By the way, for information that is intended for reuse (e.g. NOAA weather data), the government does take steps to make it easy to use by third parties. Your reprinting of random correspondence does not really rise to that level and I'd rather see my tax dollars going to solving the nation's problems than making your job more convenient.

    James D. McCartney co-author 'If You Are Me, then Who Am I? The Personal and Business Reality of Identity Theft'

  57. Bitshifter 11/05/2009 12:07 p.m. (permalink)

    Hi All, Found this article through GCN at http://gcn.com/articles/2009/11/03/sunlight-adobe-tussle.aspx# and was very confused by the following statement; maybe someone can clarify:

    "[Bobby] Caudill pointed out that it is possible to load the documents used to create a PDF directly into the PDF file. An XML document could be incorporated in such a way, for instance. So all an end user would need to do is extract the XML document from the PDF and then parse away as usual."

    I guess my question is: since Adobe has pointed out that PDF is a standard controlled by ISO, and the capability exists to embed the raw XML data into the PDF document, is there a way to do this without buying Adobe's $300 (minimum) software? I'm currently working on using Apache FOP to generate PDFs for the non-profits I serve without any need to purchase anything (yay for open source and open standards), and I see this as a powerful capability that could lead to them being more self-sufficient (pretty reports + useful data = one stop shopping.) Yet I'm not familiar enough with the PDF format to know how I'd embed the raw data directly into the document. I mean, obviously once you fork over $300 I'm sure it's easy as pie, the trouble is I'm trying to save our agency and taxpayers money, not buy more COTS software and set a precedent for our nonprofits to do the same :) Does someone have info on how to do this without buying Adobe? Or is Adobe just working on the assumption that making government transparency efforts dependant on their proprietary product would be a great thing?

    Please note I'm not trying to start a big thing here; I don't care who looks good, I just want to give maximum power to those I serve using solutions that require no money & need some technical help. Also first time I've visited your site; very cool - keep up the good work!!

  58. anonymous 11/05/2009 8:45 p.m. (permalink)

    adobe security sucks. most exploits and malware this year can be directly tied to repeated vulnerabilities in adobe flash and reader software.. a PDF viewer company should not need a Security Incident Response team, which they have if u google it

    consider security FIRST, as it is most important. research adobe and you will find how embarrassing their security has been this year... 4 zero days i think

    the only thing that this would OPEN government and internet users up too is more vulnerabilties due to horrific programming practices at adobe

  59. Jon Aro 11/07/2009 1:39 a.m. (permalink)

    I do agree that flash is 100% crap. And I note that it has the (r) next to it. PDF however (which you will note does not) is not so bad. While not as useful as plain text, it is still a better way to go than flash (or even worse MS-Word .doc files). This is because the format is sufficiently openly documented such that there are other programs that can read and write the format, including Free Software.

  60. anonymous 11/07/2009 9:29 a.m. (permalink)

    computerworld.com/s/article/9139181/Hackers_exploit_this_year_s_fourth_PDF_zero_day

    enigmasoftware.com/gumblar-trojan-resurfaces-and-exploits-adobe-vulnerabilities/

    ^^^ examples of adobe's poor security and the results of it

    there is no excuse for releasing PATCHES that contain vulnerabilities of there own

    this software should have one purpose: view PDF files. NEVER add new features, only fix bugs

    to sum everything up.... despite adobe vulns they kept releasing the insecure version on the website (rather than giving out the fixed version they created rofl)

    softpedia.com/news/Adobe-Criticized-for-Shipping-Insecure-Reader-Version-117401.shtml

  61. Alex 11/09/2009 3:42 a.m. (permalink)

    I agree with pretty much all of this post's arguments. I would add that PDFs are useful for some, and the argument should not be framed against PDF, simple pro XML as an option for 'power users'.

    I have seen much anger against Adobe formats come from frustrated pragmatic (rather than ideological) users of those formats and applications. There are lightweight PDF alternatives (far more spread in Linux than Windows/OSX communities), and regarding Flash containers HTML5 should make a big impact.

  62. Joe Carmel 11/11/2009 12:01 p.m. (permalink)

    Maybe you are unaware that PDF can be parsed with developer tools such as CAM::PDF. http://search.cpan.org/~CDOLAN/CAM-PDF/ I’ve used this module to provide human readable URLs to sections of federal legislation in PDF. For example: http://legislink.org/us10?PDF-111-HR-12-IH-11 is a link to section 11 of HR 12 (introduced version) in the 111th Congress.

    This is accomplished by first parsing the PDF file with the CAM::PDF module and then redirecting the user to the correct location in the PDF file at the government’s site using PDF Open Parameters (http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf#page=5&zoom=100,0,660).

  63. Ken Geis 11/16/2009 4:36 p.m. (permalink)

    I'm fine with government documents being published in PDF, as long as it's PDF/A or PDF version 1.4.

    Flash is a good idea for visualizing public data, as long as the data is available as XML, RDF, or JSON.

  64. Carmel Apple 12/09/2009 12:25 p.m. (permalink)

    "Here's a hint-- if the data format has an ® by its name, it probably isn't great for transparency or open data."

    Got a good chuckle from that piece of irony...who gave them the (R) in the first place?

    The government should be in the business of publishing documents in as many different formats as possible to account for the many different needs of the public who consume that data. RESTful APIs would greatly ease the problems involved. Want a PDF of that data? Run a XSLT transform to convert XML into the PDF. Not in XML format? Run a JSON to XML convertor first. The tools exist to do all of this automatically.

  65. dhdherywr 12/21/2009 8:36 a.m. (permalink)

    Hi,Dear Ladies and Gentlemen,Here are the most

    popular, most stylish and avantgarde

    shoes,handbags,Tshirts,jacket,Tracksuitw ect... www.Ebizcool.com /productlist.asp?id=s7

    (Tracksuit) Christmas is approaching, your Christmas gifts ready? kkshoe com mall for you, which involves a number of

    well-known brands from the Asia-Pacific region the

    trend of merchandise. Promotional discounts should be,

    come SHOPPING bar!Christmas sale, free shipping

    discounts are beautifully gift ,Christmas gifts,look,

    Best quality, Best reputation , Best services Service

    is our Lift. Nike shox $35,Handbags(Coach lv fendi d&g) $35 Tshirts (Polo ,ed hardy,lacoste) $16 ugg boot,POLO hoody,Jacket,ect... For details, please consult http://www.Ebizcool.com
    Thanks!!! Advance wish you a merry Christmas.

What are Your Thoughts?

Have thoughts that might fuel this discussion further, post them below. (Markdown syntax is supported in comments.)

Follow The Labs And See What We're Up To

1818 N Street NW, Suite 300
Washington, DC 20036
202.742.1520