Result Protocol

The example below is the output of a "data chewer" that finds names of members of Congress within text. The normalized name is returned along with an accuracy rating for the match and the "type" of the found element.


<?xml version="1.0" ?>
<entities name="CongresspersonFinder" version="1.0">
	<entity normalized="Senator John Cornyn" type="senator">
		<attribute name="id">fakeopenID463</attribute>
		<matchtext endpos="4992" score="1.0" startpos="4973">
			Senator John Cornyn
		</matchtext>
		<matchtext endpos="5074" score="0.6" startpos="5064">
			Mr. Cornyn
		</matchtext>
	</entity>
	<entity normalized="Senator Edward Kennedy" type="senator">
		<attribute name="id">
			fakeopenID487
		</attribute>
		<matchtext endpos="870" score="1.0" startpos="845">
			Senator Edward M. Kennedy
		</matchtext>
		<matchtext endpos="5292" score="1.0" startpos="5267">
			Senator Edward M. Kennedy
		</matchtext>
		<matchtext endpos="5632" score="0.6" startpos="5621">
			Mr. Kennedy
		</matchtext>
	</entity>
</entities>

Extraction and Normalization

Each chewer will locate specific elements contained within the data. For each element found, the values will be normalized and grouped with the normalized values. Each raw result includes the start and end position of the element within the text. The score indicates the accuracy of the relation between the element and the normalized value. See below for more information about the values of the score.

Attributes

Additional chewer-specific attributes may be attached to the results with the attribute element. In this example an ID attribute is added which corresponds to the Sunlight API ID of the congressperson.

0 to many attribute elements may be included. If more than one attribute has the same name the values may be treated as a list or array.

Match Scoring

In the above example, the match scores are arbitrary. Is there a good, quantifiable way to indicate match accuracy between disparate data types? We're not smart enough to figure that out, so we propose the following arbitrary scoring scheme.

0.0 - 1.9
No Match There is no confidence that the found value is the same as the normalized value. Ignore this matches that are given this score.
2.0 - 3.9
Unlikely Match There is a very slight chance of a match, generally unlikely matches should be ignored. Unlikely matches should still have at least some resemblance to the data being matched, but the threshold is very low. For example, the name Edward K. could possibly refer to Sen. Kennedy, so if a senator finder were written to take even poor matches into account it may return such a match.
4.0 - 5.9
Possible Match There is some possibility a match occurs here, in particular if this entity is found elsewhere in the document. For example, the last name Kennedy may match in a Senator finder, but it should likely be checked whether or not Ted Kennedy is mentioned elsewhere with a higher score before assuming it's a match as it may refer to another Kennedy, or perhaps the Kennedy Center.
6.0 - 7.9
Likely Match Should be considered perhaps in terms of other context, or pending human review. For example, a Senator finder may consider Mr. Kennedy a likely match as it clearly refers to a person.
8.0 - 9.9
Nigh exact match A near exact match. Only likely to be incorrect in the case that there are two entities with very similar or identical names. For example, Edward Kennedy is likely to be referring to the Senator, although there is a slight chance it refers to another individual so it should not be considered a perfect match.
1.0
Exact match An exact match. No doubt that this may be a false positive. For example, Senator Edward M. Kennedy leaves little doubt that it is referring to the Massachusetts Senator.

Scoring however is a somewhat subjective problem, in many ways subject to the whims of the implementer of each particular chewer. If all chewers were to be written by a single individual this would might be a problem as one could learn what a score of "0.75" meant. We feel that much of the value in separating the "data chewers" from the remainder of the system is the possibility for anybody to provide one, and because of this it is important that the match scores can be interpreted in a meaningful way across chewers.