cluelessresearch.com

political methodology, brazilian politics, etc.

Approximate matching in R

One frequent problem when working with data from multiple sources is how to match names that, at best, are approximately equal. Typical examples include matching country names from multiple data sets and matching political candidate names from electoral and legislative sources.

For country names, it is not a big deal, since the datasets at most consist of two hundred or so names, but even then the task can be boring and prone to error. For electoral candidates the problems escalates quickly. In Brazilian politics, for example, one has to match the 513 elected legislators in the lower chamber to the 5000+ candidates. What to do then? Hire monkeys undergraduate assistants? Outsource to India?

A better (and cheaper) way is to use the computer to do the grunt work for you. Again, R comes to the rescue, this time with the agrep function. (authored by David Meyer, based on C code by Jarkko Hietaniemi; with modifications by Kurt Hornik.)

Agrep by itself doesn’t help much, but help is only some useful little wrappers away. One function, agrep.match [link], does the following: a) sets names to lower case and kills multiple white spaces; b) given these transformations, matches exactly; c) with what hasn’t matched, matches approximately with a decreasing threshold of “aproximateness” (is that even a word? it is like so in the R help file); d) returns the indexes with matched and unmatched names and the corresponding thresholds used.

See example of usage in this pdf document.

Note that both links are to my current and constantly updated development version. It might break or already be broken. No warranties are made, yada yada yada. The development is in its early stages, but seems to work. If anyone has comments, suggestions or bug reports, drop me an eline (e.leoni AT gmail DOT com).

No comments yet. Be the first.

Leave a reply

You must be logged in to post a comment.