Archive for the 'Computing' Category
Geocoding
geopy is a geocoding toolbox for python. See the website for installation instructions. It uses third-party geocoders (such as google maps) so you can add geographic coordinates to the addresses in your application. I cooked up my first python script to use it. You give it a csv file with addresses and it returns a csv file with addresses + latitude and longitude. It might be useful to someone out there.
from geopy import geocoders
g = geocoders.Google('your-google-maps-api-here')
import csv
writer = csv.writer(open("out.csv", "wb"))
writer.writerow(("endereco","cidade","estado","pais","latitude","longitude"))
reader = csv.reader(open("endereco.csv"))
for row in reader:
now = row[0]+","+row[1]+","+row[2]+","+row[3]
try: place, (lat, lng) = g.geocode(now)
except: place, (lat, lng) = "NA", ("NA", "NA")
writer.writerow((row[0],row[1],row[2],row[3],lat, lng))
No comments
QuickR
There is a new (I think) website for learning R that looks pretty decent: Quick-R(http://www.statmethods.net/). It was created by Robert Kabacoff, whom I had the pleasure to meet several months ago. We discussed R briefly at that time and he was just getting into it. Apparently he has been busy! The intended audience are users of SAS/SPSS/STATA transitioning to R. If that fits your bill, go ahead and take a look. If it doesn’t and you are already an experienced users, I am sure there are more than a couple of people you can point the website to.
No commentsAnother Stata rant
So, I am using my macbook and suddenly it becomes really hot, and the fan starts at full speed. Perhaps I am encoding some music or video? Or am I doing some fancy statistical analysis?

Not really. It is Stata waiting for me to press a key! WTF!! It’s been like so for as many versions as I remember.
1 commentRegression plots
I am writing a paper with a coauthor promoting the use of graphs instead of tables in political science. We did some research on the current use of tables and graphs and found out that a substantial proportion of the tables is devoted to the display of regression results.
So we thought that creating graphs to display regression tables was essential to our task. Thais to turn, this:

into a nice graph.
We are currently revising the paper for (re)submission, and are still undecided on how to display such graphs. Here are some of the several revisions, from one of the first (back in November)

This one from later in the same month

And the two I am currently looking, taking out the boxes aroud the plots. This one is minimalist:

And the next one has the x-axis repeated in each plot:

My coauthor thinks the boxes are necessary, I cite Tufte over and over and say they aren’t. I think we will have to arrange an intercontinental boxing match to settle the issue.
No commentsStata and accents (diacriticals)
Two times already I’ve emailed Stata support to complain about character encoding issues in Stata. In a nutshell, the problem is that if a dataset has diacriticals and was created it in a windows machine the characters will be mangled up in a mac, and vice versa. I assume there are similar problems in linux. This post is fairly long, keep reading if you have similar problems.
1 commentApproximate matching in R
One frequent problem when working with data from multiple sources is how to match names that, at best, are approximately equal. Typical examples include matching country names from multiple data sets and matching political candidate names from electoral and legislative sources.
For country names, it is not a big deal, since the datasets at most consist of two hundred or so names, but even then the task can be boring and prone to error. For electoral candidates the problems escalates quickly. In Brazilian politics, for example, one has to match the 513 elected legislators in the lower chamber to the 5000+ candidates. What to do then? Hire monkeys undergraduate assistants? Outsource to India?
A better (and cheaper) way is to use the computer to do the grunt work for you. Again, R comes to the rescue, this time with the agrep function. (authored by David Meyer, based on C code by Jarkko Hietaniemi; with modifications by Kurt Hornik.)
Agrep by itself doesn’t help much, but help is only some useful little wrappers away. One function, agrep.match [link], does the following: a) sets names to lower case and kills multiple white spaces; b) given these transformations, matches exactly; c) with what hasn’t matched, matches approximately with a decreasing threshold of “aproximateness” (is that even a word? it is like so in the R help file); d) returns the indexes with matched and unmatched names and the corresponding thresholds used.
See example of usage in this pdf document.
Note that both links are to my current and constantly updated development version. It might break or already be broken. No warranties are made, yada yada yada. The development is in its early stages, but seems to work. If anyone has comments, suggestions or bug reports, drop me an eline (e.leoni AT gmail DOT com).
No commentsStata’s outreg equivalent for R
[Update 4/9/2008 -- A comment below suggests using the R package memisc, function mtable. It seems to be a useful package overall (much more stuff than just tables), and the tables it generates look pretty good. I didn't know about it, thanks for pointing it out.]
A nice feature of STATA is the large number of ado files helping in the creation of tables of coefficient estimates that you can cut and paste into your (yuck!) Word document, or much more elegantly, produce LaTeX code for your table that you can include in your LaTeX document. R has a somewhat similar feature with the package xtable, but it currently lacks the ability of producing a single table in which the columns have the results from different models or specifications.
Some time ago Ajay Narottam Shah published some code at the R-help list to do just that. I took just part of it and tweaked a bit. You give it a matrix of coefficients and a matrix of standard errors and it produces the latex code.
Here it is (link):
latex.table <- function(coef.mat,se.mat,digits=3,table.command=TRUE) {
nc <- ncol(coef.mat)
coef.mat <- round(coef.mat,3)
se.mat <- round(se.mat,3)
text.now <- NULL
if (table.command) {
text.now <- c(text.now,"\\begin{table}\n")
text.now <- c(text.now,"\\centering\n")
}
text.now <- c(text.now,"\\begin{tabular}[R]{",rep("c",nc+1),"}\n")
text.now <- c(text.now,"\\hline\n")
for (j in 1:ncol(coef.mat)) {
text.now <- c(text.now," & ", colnames(coef.mat)[j])
}
text.now <- c(text.now,"\\\\\n\\hline\n")
for (i in 1:nrow(coef.mat)) {
##print coef estimates
text.now <- c(text.now,rownames(coef.mat)[i])
for (j in 1:ncol(coef.mat)) {
if (is.na(coef.mat[i,j])) {
text.now <- c(text.now," & ")
} else {
text.now <- c(text.now," & ", coef.mat[i, j])
}
}
text.now <- c(text.now,"\\\\\n")
## print SEs
for (j in 1:ncol(coef.mat)) {
if (is.na(se.mat[i,j])) {
text.now <- c(text.now," & ")
} else {
text.now <- c(text.now," & ", sprintf("(%s)", se.mat[i,j]))
}
}
text.now <- c(text.now,"\\\\[1mm]\n")
}
text.now <- c(text.now,"\\\\\n")
text.now <- c(text.now,"\\hline")
text.now <- c(text.now,"\n")
text.now <- c(text.now,"\\end{tabular}\n")
if (table.command) text.now <- c(text.now,"\\end{table}\n")
paste(text.now,collapse="")
}
So, for
tmp.estimates
cluster jack-knife lmer lmerMcmc edvreg
iquality -0.503 -0.503 -0.414 -0.413 -0.454
iqualityrep 0.030 0.030 0.019 0.019 0.019
gdppc 0.022 0.022 0.024 0.023 0.047
and
tmp.se
cluster jack-knife lmer lmerMcmc edvreg
1 0.1230 0.1623 0.1593 0.213 0.1905
2 0.0056 0.0065 0.0092 0.012 0.0092
3 0.0149 0.0280 0.0327 0.042 0.0634
latex.table(tmp.estimates,tmp.se,table.command=TRUE)
"\\begin{table}\n\\centering\n\\begin{tabular}[R]{cccccc}\n\\hline\n & cluster & jack-knife & lmer & lmerMcmc & edvreg\\\\\n\\hline\niquality & -0.503 & -0.503 & -0.414 & -0.413 & -0.454\\\\\n & (0.123) & (0.162) & (0.159) & (0.213) & (0.19)\\\\[1mm]\niqualityrep & 0.03 & 0.03 & 0.019 & 0.019 & 0.019\\\\\n & (0.006) & (0.006) & (0.009) & (0.012) & (0.009)\\\\[1mm]\ngdppc & 0.022 & 0.022 & 0.024 & 0.023 & 0.047\\\\\n & (0.015) & (0.028) & (0.033) & (0.042) & (0.063)\\\\[1mm]\n\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n”
which you can dump to a file as:
cat(latex.table(tmp.estimates,tmp.se,table.command=TRUE),file="table.tex")
producing
Nifty!
2 commentsR in OS X - Making quartz device work from terminal or Emacs
buggy, but works! I was getting really tired of X11 in OS X…
You need the apple developer tools installed (comes in the Tiger DVD, or can be downloaded from Apple Developer Connection)
install.packages("CarbonEL",,'http://rforge.net/',type='source')
library(CarbonEL)
and then
quartz()
opens a graphic window in OS X.
No commentsData Visualization for the Masses
There are a few websites/startup companies trying to fly the idea of being a repository and visualization engine for data that anyone can upload. Swivel, for example, made some splash in the past few months as the “youtube for data”.
I experimented with Swivel and a few others. The main problem with all of them is the lack of tools allowing conditioning, at least in an easy way. Conditioning is important for constructing small multiple plots, or even plotting groups in different colors in scatter plot. For example, take the ideal point data I uploaded:
In order to different parties in different colors I would have (as far as I know) to upload a different dataset for each party! It goes without saying that this is unnecessarily burdensome.
Many Eyes is also very impressive, with more advanced visualization plots. It is java based, and does not play so nicely in firefox at the mac, unfortunately. I also had problems with the ideal points dataset there. It doesn’t allow one to create a scatter plot with only two variables (!!!) requesting a third to be displayed as the size of the symbols.
The focus on bar charts on both platforms is also annoying… dotplots and boxplots would be nice.
There is another site data360, but I didn’t have much luck. It is more “professional”, allowing one to pull data directly from the web automatically. But its focus on time series data makes it simply unusable for the example I tried.
Of course, all three are just beginning and we might be hearing a lot about them soon. And the graphs they make are not half bad. Not until you consider rich chart live, that is. This one is ugly! And I mean, it makes you miss excel kind of ugly:

Statistics and Programming
Last month at Machine Learning there was a discussion about the creation of a machine learning department at Canergie Mellon University. The discussion of the post was fairly interesting, in particular this pearl by John Langford:
I regard ‘rogramming as the missing member of reading, ‘riting, and ‘rithmetic, and I’ve found a statistical understanding of the world genuinely valuable.
I couldn’t agree more. As social scientists, we need a much better training in programming if we are to partake and benefit from the high paced increase in computer power available. The question is, how can we properly train graduate students in the social sciences, who tend to start the program with very little knowledge or interest in programming?
It seems to me that the classes and books available miss badly the mark. They either assume a lot of programming experience (e.g S Programming by Venables and Ripley and Programming with Data by John Chambers) or focus too much on the statistics side of things without a proper discussion of the basic programming concepts.
I am certainly not alone in this assessment. Back in November 2006, Jan de Leew made the following suggestion in the StatCompute mailing list:
This is the title of a series of free computer programming
textbooks, started by Alan Downey. There are versions
for C++, Java, Logo, and Python.
…
Since the LaTeX for these books is also freely available,
it may not be too hard (and possibly quite useful) to
make an R/S version. What seems to be common practice is to edit
the original LaTeX and then add your name to the author list.
But should (most) social scientists learn how to program? This was a topic of an extended discussion at PolMeth . The main argument against it was that we should leave programming to the “pros”, i.e. the programmers at Stata/SAS/R. We perhaps shouldn’t trust computer code written by us plain social scientists. My own take is that a large chunk of data analysis is simply indistinguishable from programming. The problem is that it currently mostly done via a non-reproducible, error prone and downright ugly way. Even the most basic understanding of flow control, loops and data structures will be a quantum leap for most of the current statistical practice, at least in political science.
Therefore, something should be done to cover the gap. Luckily this is not only a problem for us lowly social scientists. Alan Downey, the author of “How to Think Like a Computer Scientist”, is currently finishing up a book on “Physical Modeling in MATLAB” which appears to have the same basic idea, although focused on a “real” science and a different language. It is a “free book” covered by the GNU Free Documentation License, so it is possible to adapt it (or even combine it with part of the “How to Think…” series) in order to make it more applicable to social scientists.
No comments

