Continuous Data Analysis
Every watchdog organization and project builds its own tools to correct addresses, standardize company names, and the like, and every month as new data is pumped out by the government, these filters have to be tweaked and updated. We need to ensure that a variety of information (such as corporate names, subsidiaries, and addresses) is standardized in a way that is useful to multiple entities to make it easier to mash data together to tell the larger picture stories.
Sunlight Labs is already at work on a solution to exactly that problem, calling it Continuous Data Analysis (CDA). CDA provides comprehensive approach that combines ongoing computer-driven examination and tagging of the information in raw government records with an open architecture for processing that data. CDA is based on three distinct data operation concepts:
Extraction
A vast amount of government information is stored in unstructured electronic formats. In order to make this data usable elements must be extracted manually or programmatically. Both methods are expensive options. CDA seeks to assist in data extraction by providing a common toolkit and result schema that built to extract a small set of specialized types from unstructured data. Read more about our protocol proposal.
Normalization
Even when obtained from the government in a structured format, data is often incorrect, misspelled and inconsistent. Addresses may be formatted differently. Nicknames may be used instead of full names. Corporation names may be spelled differently. CDA includes facilities for normalizing names across multiple data sets to make them easier to integrate and analyze. Read more about our protocol proposal and how accuracy is maintained.
Provenance
As data is extracted and normalized, provenance metadata is attached to the results. This will allow consumers to know where the metadata came from and when it was added. CDA also provides a method to notify consumers when source data has been updated. Read our proposal.