cluelessresearch.com

political methodology, brazilian politics, etc.

Regression plots

I am writing a paper with a coauthor promoting the use of graphs instead of tables in political science. We did some research on the current use of tables and graphs and found out that a substantial proportion of the tables is devoted to the display of regression results.

So we thought that creating graphs to display regression tables was essential to our task. Thais to turn, this:

into a nice graph.

We are currently revising the paper for (re)submission, and are still undecided on how to display such graphs. Here are some of the several revisions, from one of the first (back in November)

This one from later in the same month

And the two I am currently looking, taking out the boxes aroud the plots. This one is minimalist:

And the next one has the x-axis repeated in each plot:

My coauthor thinks the boxes are necessary, I cite Tufte over and over and say they aren’t. I think we will have to arrange an intercontinental boxing match to settle the issue.

No comments

Stata and accents (diacriticals)

Two times already I’ve emailed Stata support to complain about character encoding issues in Stata. In a nutshell, the problem is that if a dataset has diacriticals and was created it in a windows machine the characters will be mangled up in a mac, and vice versa. I assume there are similar problems in linux. This post is fairly long, keep reading if you have similar problems.

Read more

1 comment

The move

So, I decided to spend some time in back in my home country for a variety of reasons, and between the move and just watching the scenery, I haven’t been able to find time to post. I had to buy a new computer to take with me and left my desktop pc, which lately has not been much more than a glorified jukebox, behind. Moving the iTunes library from the PC to my shiny new mac was not very painful, but the songs with diacriticals in the file names (ç˜^ç, etc) didn’t load correctly on the mac. So I wrote down this evil, evil, shell script, which goes down the directory structure renaming all your files and folders substituting diacriticals with regular ascii letters.

#!/bin/sh
find .  -name '* *' | while read S1
do
    S2=`echo "$S1" | unaccent`
    if [ "$S1" != "$S2" ]
    then
    echo "renaming $S1 to $S2"
    ##mv "$S1" "$S2"
    fi
done

If you want the actual renaming to be done, uncomment the line with the command mv "$S1" "$S2" You will also need this rather useful perl script to remove the diacriticals (should be called unaccent and be on your path) (from http://ahinea.com/en/tech/perl-unicode-struggle.html)

#!/usr/bin/perl -w
require Encode;
 use Unicode::Normalize;

#$str = ;
#for ( $str ) {  # the variable we work on
while (<>) {  # the variable we work on

   ##  convert to Unicode first
   ##  if your data comes in Latin-1, then uncomment:
   #$_ = Encode::decode( 'iso-8859-1', $_ );  

   s/\xe4/ae/g;  ##  treat characters ä ñ ö ü ÿ
   s/\xf1/ny/g;  ##  this was wrong in previous version of this doc
   s/\xf6/oe/g;
   s/\xfc/ue/g;
   s/\xff/yu/g;

   $_ = NFD( $_ );   ##  decompose (Unicode Normalization Form D)
   s/\pM//g;         ##  strip combining characters

   # additional normalizations:

   s/\x{00df}/ss/g;  ##  German beta “ß” -> “ss”
   s/\x{00c6}/AE/g;  ##  Æ
   s/\x{00e6}/ae/g;  ##  æ
   s/\x{0132}/IJ/g;  ##  IJ
   s/\x{0133}/ij/g;  ##  ij
   s/\x{0152}/Oe/g;  ##  Œ
   s/\x{0153}/oe/g;  ##  œ

   tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ
   tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł
   tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊʼnŋØøſ
   tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/;                   # ÞŦþŧ

   s/[^\0-\x80]//g;  ##  clear everything else; optional

   print $_;
 }

My external hard disk with my music is formated in NTFS, so I needed a way to get write access to it if I wanted my files to be kept there. I managed to do it flawlessly following these instructions. Macfuse (the tool doing the grunt work) appears to be cpu intensive, but for the relatively small music files it worked fine. If you have questions, I would be glad to write up more carefully what I did. Leave a comment (you have to be registered, but anyone can register) or email me at e dot leoni at gmail dot com .

No comments

Word vs. LaTeX struggles? Try Markdown!

Here is very common situation. You are writing a paper or report and is addicted to the all the goodies (font, bibliography, and most of all, math) LaTeX offers to the “chosen ones”. However, your coauthor or coworker is not inclined to spend the inordinate amount of time required to learn LaTeX. What do you do?

If your coauthor is willing to use something other than Word, you could convince him/her to try Lyx. Lyx is a “what you see is what you mean” editor using a LaTeX backend. My experience with it is limited, but I know people that are very happy with it. If you think you want to learn LaTeX but are afraid to take the plunge, this is a very good option.

Not everyone is happy to try new software, however. I think Markdown might provide a useful way to cooperate.

The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. While Markdown’s syntax has been influenced by several existing text-to-HTML filters, the single biggest source of inspiration for Markdown’s syntax is the format of plain text email.

How does it look? You can see for yourself here. The syntax is very natural. Headings can be marked as:

A First Level Header
====================

A Second Level Header
---------------------

or as

# First Level Header

## Second Level Header

*This* is how you put emphasis. An introduction to markdown can be found here.

Once your text is ready for publication, you can convert Markdown texts into LaTeX or .rtf (among other choices) using the extremely cool Pandoc. Using Pandoc, you can use math in LaTeX style (like so: $y_{ic}=\gamma_{00}+\gamma_{10}x_{ic}+\gamma_{01}z_c+\gamma_{11}x_{ic} \cdot z_c+v_{ic}$)

Pandoc allows other LaTeX code in the text, such as tables and equations, so you can minimize the going back and forth between Markdown and LaTeX.

Would this work in a coauthoring environment? Would people freak out by the mere thought of using a markup tool? I will let you know when I try this out myself.

No comments

Prostitution: the oldest profession in the world?

In the United States, only in two states is prostitution legal: Nevada and (gasp!) Rhode Island. It is commonly referred to as the “oldest profession in the world”, but not much evidence is presented to support that fact. Finally, it seems that a group of Yale researchers has found some evidence by studying Capuchin monkeys.

The researchers (M. Keith Chen, and economist, and psychologists Venkat Lakshminarayanan and Laurie R. Santos) were interested in finding out if species other than humans also act in market-like fashion, and/or exhibit similar biases to those observed among us. They ended up observing behavior that might support the “oldest profession” contention, I think. You see, like a more than a couple of Homo sapiens I know, Capuchin monkeys are exclusively focused on food and sex. The researchers intended to use their food seeking behavior to study their actions when currency and trade are introduced in their environment.

The research was published at the June 2006 issue of the Journal of Political Economy (link to the paper), and has been discussed in the Freakonomics blog.

It turns out that the, ahem, other interests of the Capuchin monkeys showed up in their research as well. Stephen J. Dubner, the other freakonomic, tells the story during a keynote at the AIIM Expo:

The topper to Dubner’s stories to the AIIM audience involved an
incident in which one of the capuchins threw a tray of washers that
ended up spilling into the general population area. The monkeys, as
expected, fought for the coins and, except for one, were easily bribed
with the opportunity to purchase food in order for researchers to get
the washers back.

‘Out of the corner of his eye, Chen saw that one monkey gave a coin to
another (instead of rushing to exchange it for treats.) He thinks, am
I witnessing the first instance of monkey altruism? No. He was
actually witnessing something he said he really wished he hadn’t
seen,’ said Dubner.

After a brief grooming ritual, the monkeys who exchanged the coin
started to have sex. Immediately after the incident, the paid monkey
went over to Chen to get food in exchange for returning the coin,
Dubner said.

Well, it may not prove that prostitution actually is the oldest profession. But, in this market at least, it emerged pretty damn soon!

(via Slashdot)

No comments

Election 2002 data (Brazil)

The TSE for unknown reasons pulled out the Access files for the 2002 election. I haven’t looked at them in a while, but here are the two that I have:

estcand2002

and

VotoMun_DadosCand_2002.mdb.zip

these are big files. Let me know if you find out that just one of the two is needed or if you have problems downloading. eleoni at gmail dot com

3 comments

Approximate matching in R

One frequent problem when working with data from multiple sources is how to match names that, at best, are approximately equal. Typical examples include matching country names from multiple data sets and matching political candidate names from electoral and legislative sources.

For country names, it is not a big deal, since the datasets at most consist of two hundred or so names, but even then the task can be boring and prone to error. For electoral candidates the problems escalates quickly. In Brazilian politics, for example, one has to match the 513 elected legislators in the lower chamber to the 5000+ candidates. What to do then? Hire monkeys undergraduate assistants? Outsource to India?

A better (and cheaper) way is to use the computer to do the grunt work for you. Again, R comes to the rescue, this time with the agrep function. (authored by David Meyer, based on C code by Jarkko Hietaniemi; with modifications by Kurt Hornik.)

Agrep by itself doesn’t help much, but help is only some useful little wrappers away. One function, agrep.match [link], does the following: a) sets names to lower case and kills multiple white spaces; b) given these transformations, matches exactly; c) with what hasn’t matched, matches approximately with a decreasing threshold of “aproximateness” (is that even a word? it is like so in the R help file); d) returns the indexes with matched and unmatched names and the corresponding thresholds used.

See example of usage in this pdf document.

Note that both links are to my current and constantly updated development version. It might break or already be broken. No warranties are made, yada yada yada. The development is in its early stages, but seems to work. If anyone has comments, suggestions or bug reports, drop me an eline (e.leoni AT gmail DOT com).

No comments

New maps

Jos compains of my 1995 technique to create the animations (animated gifs), and the lack of interactivity. Perhaps this flash(y) version will be of greater appeal to him.

The whole 1982-2006 period is posted. If you pay attention, the map changes slightly to reflect changes in the distribution of seats across the nation. E.g. the creation of Tocantins following the 1988 Constitution (1990 map), or the increase in the number of deputies elected from São Paulo in 1994.

1 comment

Spatial distribution of Parties in Brazil

I’ve been collecting Brazilian electoral data for my dissertation for some time, and have always wondered about how to display the somewhat massive data available in an efficient manner. Take the best case scenario: 27 districts (states) x 4 elections (since 1994) x 7 largest parties=756 data points. This is a lot of numbers to look at in a table! Imagine using aggregate data at the municipality level: 5000 x 4 x 7! No, you do the math…

Maps, of course, is one way to display the data. The major problem then is that the widely varying population density in Brazil would produce a misleading map of the voting distribution across the country. That is the reason I am considering using cartograms, as discussed in the last post (link). Displayed below is a whole set of cartograms displaying data for the Câmara dos Deputados for the past four elections. The idea now is that areas in the cartogram should be proportional to the number of seats assigned to each district (which in Brazil are the states.) Given the high degree of malapportionment, the cartogram looks somewhat different from the one based on population or vote totals we presented previously. The time dimension is presented as a movie, so it is easy to follow the spatial distribution of seats for each party throughout the recent elections.

animation.gif

Now I only have to figure out how to put this in paper format…

No comments

Cartogram for the 2006 election, 2nd round

The Brazilian electoral court (TSE - Tribunal Superior Eleitoral) has finally posted the 2006 elections results in a format suitable for researchers. This past week I got the data in shape for analysis in my dissertation and decided it was a good time to do some charts. As usual, the plots were done in R, this time using the maptools package.

The second round was a landslide in favour of the incumbent, Lula da Silva, from the PT (Workers’ Party). He got around 61% of the votes, while Geraldo Alckmin got 39%.

Perhaps more interesting is the spatial distribution of the votes. The individual units in the map are what the Brazilian Geographic and Statistical Institute (IBGE) calls “mesoregions”, but the original data is by municipality and electoral zone.

Original projection

It is noticeable how the Northeast is overwhelmingly red, indicating Lula won there by extremely wide margins. On the other hand, margins were much thinner in the south, in the center-west and in São Paulo.

I’ve always been dissatisfied with maps like this, since it overrepresents areas such as the west of the country, where the population (and therefore vote) density is much lower than in the coast. Ditto for country areas versus the big cities. Yet, the geographical representation allows us to grasp the overall pattern and correlate with facts that we know. For example, the northeast is much poorer than the south, so we immediately recognize that Lula did worse in richer areas.

Cartograms are a way to “correct” the overrepresentation of low density areas. By correction, of course, we mean distorting, but that is the whole point of the procedure. For voting and other social science data, geographic distance is just as arbitrary. Gastner and Newman invented one method to produce cartograms that seem to work very well in practice (paper here.) The original software was written in C, but there is a java version by Frank Hardisty which I used, since it uses shapefiles as input and output. Click here to take a look at maps for the 2004 US election.

Cartogram

Most of the Brazilian west is dramatically shrunk, while the big cities (particularly São Paulo) are several times blown up. In fact, I find it particularly helpful in showing the votes in the big metropolitan areas, and comparing it to areas in the country side. Although interesting, I wonder if the cartogram is too distorted to be useful, and would be interested in hearing other opinions.

No comments

« Previous PageNext Page »