cluelessresearch.com

political methodology, brazilian politics, etc.

Archive for March, 2007

Approximate matching in R

One frequent problem when working with data from multiple sources is how to match names that, at best, are approximately equal. Typical examples include matching country names from multiple data sets and matching political candidate names from electoral and legislative sources.

For country names, it is not a big deal, since the datasets at most consist of two hundred or so names, but even then the task can be boring and prone to error. For electoral candidates the problems escalates quickly. In Brazilian politics, for example, one has to match the 513 elected legislators in the lower chamber to the 5000+ candidates. What to do then? Hire monkeys undergraduate assistants? Outsource to India?

A better (and cheaper) way is to use the computer to do the grunt work for you. Again, R comes to the rescue, this time with the agrep function. (authored by David Meyer, based on C code by Jarkko Hietaniemi; with modifications by Kurt Hornik.)

Agrep by itself doesn’t help much, but help is only some useful little wrappers away. One function, agrep.match [link], does the following: a) sets names to lower case and kills multiple white spaces; b) given these transformations, matches exactly; c) with what hasn’t matched, matches approximately with a decreasing threshold of “aproximateness” (is that even a word? it is like so in the R help file); d) returns the indexes with matched and unmatched names and the corresponding thresholds used.

See example of usage in this pdf document.

Note that both links are to my current and constantly updated development version. It might break or already be broken. No warranties are made, yada yada yada. The development is in its early stages, but seems to work. If anyone has comments, suggestions or bug reports, drop me an eline (e.leoni AT gmail DOT com).

No comments

New maps

Jos compains of my 1995 technique to create the animations (animated gifs), and the lack of interactivity. Perhaps this flash(y) version will be of greater appeal to him.

The whole 1982-2006 period is posted. If you pay attention, the map changes slightly to reflect changes in the distribution of seats across the nation. E.g. the creation of Tocantins following the 1988 Constitution (1990 map), or the increase in the number of deputies elected from São Paulo in 1994.

1 comment

Spatial distribution of Parties in Brazil

I’ve been collecting Brazilian electoral data for my dissertation for some time, and have always wondered about how to display the somewhat massive data available in an efficient manner. Take the best case scenario: 27 districts (states) x 4 elections (since 1994) x 7 largest parties=756 data points. This is a lot of numbers to look at in a table! Imagine using aggregate data at the municipality level: 5000 x 4 x 7! No, you do the math…

Maps, of course, is one way to display the data. The major problem then is that the widely varying population density in Brazil would produce a misleading map of the voting distribution across the country. That is the reason I am considering using cartograms, as discussed in the last post (link). Displayed below is a whole set of cartograms displaying data for the Câmara dos Deputados for the past four elections. The idea now is that areas in the cartogram should be proportional to the number of seats assigned to each district (which in Brazil are the states.) Given the high degree of malapportionment, the cartogram looks somewhat different from the one based on population or vote totals we presented previously. The time dimension is presented as a movie, so it is easy to follow the spatial distribution of seats for each party throughout the recent elections.

animation.gif

Now I only have to figure out how to put this in paper format…

No comments

Cartogram for the 2006 election, 2nd round

The Brazilian electoral court (TSE - Tribunal Superior Eleitoral) has finally posted the 2006 elections results in a format suitable for researchers. This past week I got the data in shape for analysis in my dissertation and decided it was a good time to do some charts. As usual, the plots were done in R, this time using the maptools package.

The second round was a landslide in favour of the incumbent, Lula da Silva, from the PT (Workers’ Party). He got around 61% of the votes, while Geraldo Alckmin got 39%.

Perhaps more interesting is the spatial distribution of the votes. The individual units in the map are what the Brazilian Geographic and Statistical Institute (IBGE) calls “mesoregions”, but the original data is by municipality and electoral zone.

Original projection

It is noticeable how the Northeast is overwhelmingly red, indicating Lula won there by extremely wide margins. On the other hand, margins were much thinner in the south, in the center-west and in São Paulo.

I’ve always been dissatisfied with maps like this, since it overrepresents areas such as the west of the country, where the population (and therefore vote) density is much lower than in the coast. Ditto for country areas versus the big cities. Yet, the geographical representation allows us to grasp the overall pattern and correlate with facts that we know. For example, the northeast is much poorer than the south, so we immediately recognize that Lula did worse in richer areas.

Cartograms are a way to “correct” the overrepresentation of low density areas. By correction, of course, we mean distorting, but that is the whole point of the procedure. For voting and other social science data, geographic distance is just as arbitrary. Gastner and Newman invented one method to produce cartograms that seem to work very well in practice (paper here.) The original software was written in C, but there is a java version by Frank Hardisty which I used, since it uses shapefiles as input and output. Click here to take a look at maps for the 2004 US election.

Cartogram

Most of the Brazilian west is dramatically shrunk, while the big cities (particularly São Paulo) are several times blown up. In fact, I find it particularly helpful in showing the votes in the big metropolitan areas, and comparing it to areas in the country side. Although interesting, I wonder if the cartogram is too distorted to be useful, and would be interested in hearing other opinions.

No comments

Hairy Toad

I wonder if it is a common practice in other countries for the big press to call the president names like “sapo barbudo” (hairy toad). Or is it just a sign of prejudice against an uneducated president born in the poor northeast?

sapo barbudo

3 comments

Conditioning

Brian Mulloy, one of the founders of Swivel, wrote a nice comment on my post explaining how Swivel is in fact able to condition on data categories. For example, if you want to do a graph highlighting a particular category, or even using only data from a particular category, you are able to. The process has to start in the dataset view.

Dmitry Dimov calory and Dmitry Dimov costs

See his comment for the full explanation.

I guess I have to spend more time on it, but it still doesn’t seem to be able to do what I want. I downloaded my own data in csv format and created a couple of figures using the ggplot package in R. I don’t expect Swivel to have the same flexibility, since its objectives are very different from those of an academic statistical software. However, I don’t see why in the not so distant future something like this would be possible in a web application.

color by party

color by party, one plot per state

The code:

Read more

No comments

Stata’s outreg equivalent for R

[Update 4/9/2008 -- A comment below suggests using the R package memisc, function mtable. It seems to be a useful package overall (much more stuff than just tables), and the tables it generates look pretty good. I didn't know about it, thanks for pointing it out.]

A nice feature of STATA is the large number of ado files helping in the creation of tables of coefficient estimates that you can cut and paste into your (yuck!) Word document, or much more elegantly, produce LaTeX code for your table that you can include in your LaTeX document. R has a somewhat similar feature with the package xtable, but it currently lacks the ability of producing a single table in which the columns have the results from different models or specifications.

Some time ago Ajay Narottam Shah published some code at the R-help list to do just that. I took just part of it and tweaked a bit. You give it a matrix of coefficients and a matrix of standard errors and it produces the latex code.

Here it is (link):

latex.table <- function(coef.mat,se.mat,digits=3,table.command=TRUE) {
nc <- ncol(coef.mat)
coef.mat <- round(coef.mat,3)
se.mat <- round(se.mat,3)
text.now <- NULL
if (table.command) {
text.now <- c(text.now,"\\begin{table}\n")
text.now <- c(text.now,"\\centering\n")
}
text.now <- c(text.now,"\\begin{tabular}[R]{",rep("c",nc+1),"}\n")
text.now <- c(text.now,"\\hline\n")
for (j in 1:ncol(coef.mat)) {
text.now <- c(text.now," & ", colnames(coef.mat)[j])
}
text.now <- c(text.now,"\\\\\n\\hline\n")
for (i in 1:nrow(coef.mat)) {
##print coef estimates
text.now <- c(text.now,rownames(coef.mat)[i])
for (j in 1:ncol(coef.mat)) {
if (is.na(coef.mat[i,j])) {
text.now <- c(text.now," & ")
} else {
text.now <- c(text.now," & ", coef.mat[i, j])
}
}
text.now <- c(text.now,"\\\\\n")
## print SEs
for (j in 1:ncol(coef.mat)) {
if (is.na(se.mat[i,j])) {
text.now <- c(text.now," & ")
} else {
text.now <- c(text.now," & ", sprintf("(%s)", se.mat[i,j]))
}
}
text.now <- c(text.now,"\\\\[1mm]\n")
}
text.now <- c(text.now,"\\\\\n")
text.now <- c(text.now,"\\hline")
text.now <- c(text.now,"\n")
text.now <- c(text.now,"\\end{tabular}\n")
if (table.command) text.now <- c(text.now,"\\end{table}\n")
paste(text.now,collapse="")
}

So, for

tmp.estimates
cluster jack-knife lmer lmerMcmc edvreg
iquality -0.503 -0.503 -0.414 -0.413 -0.454
iqualityrep 0.030 0.030 0.019 0.019 0.019
gdppc 0.022 0.022 0.024 0.023 0.047

and

tmp.se

cluster jack-knife lmer lmerMcmc edvreg
1 0.1230 0.1623 0.1593 0.213 0.1905
2 0.0056 0.0065 0.0092 0.012 0.0092
3 0.0149 0.0280 0.0327 0.042 0.0634

latex.table(tmp.estimates,tmp.se,table.command=TRUE)

"\\begin{table}\n\\centering\n\\begin{tabular}[R]{cccccc}\n\\hline\n & cluster & jack-knife & lmer & lmerMcmc & edvreg\\\\\n\\hline\niquality & -0.503 & -0.503 & -0.414 & -0.413 & -0.454\\\\\n & (0.123) & (0.162) & (0.159) & (0.213) & (0.19)\\\\[1mm]\niqualityrep & 0.03 & 0.03 & 0.019 & 0.019 & 0.019\\\\\n & (0.006) & (0.006) & (0.009) & (0.012) & (0.009)\\\\[1mm]\ngdppc & 0.022 & 0.022 & 0.024 & 0.023 & 0.047\\\\\n & (0.015) & (0.028) & (0.033) & (0.042) & (0.063)\\\\[1mm]\n\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n”

which you can dump to a file as:


cat(latex.table(tmp.estimates,tmp.se,table.command=TRUE),file="table.tex")

producing

table

Nifty!

2 comments

R in OS X - Making quartz device work from terminal or Emacs

buggy, but works! I was getting really tired of X11 in OS X…

You need the apple developer tools installed (comes in the Tiger DVD, or can be downloaded from Apple Developer Connection)

install.packages("CarbonEL",,'http://rforge.net/',type='source')
library(CarbonEL)

and then

quartz()

opens a graphic window in OS X.

No comments

Nike+

After losing my dear last gen ipod shuffle in the plane, I “had” to buy another ipod. I decided to get a nano, and on a whim got the nike+ sensor since I decided to start running again. In theory you should buy special nike sneakers that have a place in the sole to put the sensor in.

Since I usually don’t have $100 lying around, I spent $30 on a new balance I found on sale, and got this totally geeky thing that includes a velcro pouch you that you can put in your shoe laces.

So, there I went, first at the treadmill in the gym. After some calibration, and tying the shoes a little tighter, every thing was fine. Very precise instrument, for the price. One I get back to Rochester in the weekend, I decide to run out for that nice run on the snow (no gym membership there) and, to my surprise, the sensor stops working… After some experimentation, I am pretty sure the thing doesn’t like the cold. (Who can blame it, really?) With the actual nike sneakers it wouldn’t be much of a problem, since it should be fairly warm inside your shoes… then again, $100 for a pair of shoes…

This pretty much made it impossible for me to complete my goal set at the Nike+ website, 60 miles in a month. I didn’t actually run 60 miles, more like 45, but only 6 runs were recorded, as displayed in this incredibly junky chart:

Nike Plus runs

Which brings us to my second point in this post. I am not a fan of bar charts in general, but this one takes the cake. Note how the bars start at -1 !!! Amazing, you don’t even start and you have already ran a mile… take that as a moral booster!

In any case, it does allow one to pull the data and display it in all its (text) glory in the sidebar that you should be able to see in the right. I used the wordpress plugin Nike+ stats, in case you are wondering.

No comments