cluelessresearch.com

political methodology, brazilian politics, etc.

Archive for the 'Statistics' Category

QuickR

There is a new (I think) website for learning R that looks pretty decent: Quick-R(http://www.statmethods.net/). It was created by Robert Kabacoff, whom I had the pleasure to meet several months ago. We discussed R briefly at that time and he was just getting into it. Apparently he has been busy! The intended audience are users of SAS/SPSS/STATA transitioning to R. If that fits your bill, go ahead and take a look. If it doesn’t and you are already an experienced users, I am sure there are more than a couple of people you can point the website to.

No comments

Breaking news: A college degree in Brazil is worth 4 fm radios more than high school!

ABEP, the Brazilian association of market research firms, has just approved the new "Brazil criterion" or CCEB.  CCEB is the standardized way to measure survey respondents’  consumption power by asking questions about consumption items they have and the education of the head of the household. They argue this better than just asking an individual’s household income, particularly in countries with high inflation or black market economies.

The newly designed CCEB was designed using a regression of household income on a  set of items and trimming them down using qualitative. For example, computers are excluded given the accelerated increase in computer ownership taking place. We don’t want items  that are subject to large changes in a short period of time, since consumption power itself does not in general move very fast.

My first qualm with it is (you guessed) methodological in nature. You see, the regression they estimated as a basis of the index has the log of household income on the set of items with no interactions whatsoever! This can’t possibly be the "best" regression they could find! And I am pretty sure it wasn’t. The underlying objective is to have a way for interviewers on the field to categorize the "class" of the respondent as a filter (for quotas), and they probably think interviewers in Brazil know how to add but not how to multiply. ABEP criticizes the Mexico index for using a classification tree  (which neatly allows interactions) for being too prone to error by interviewers. I would like to see the study showing this.

So, how does the CCEB look like? Based on the regression they created a point system.  Thus, if you have one color TV you get one point and if you have four you get four points. In addition, you get extra points for the education of the head of the household.

Now excuse me for I have to go  to the store buy some cheap radios…

1 comment

Another Stata rant

So, I am using my macbook and suddenly it becomes really hot, and the fan starts at full speed. Perhaps I am encoding some music or video? Or am I doing some fancy statistical analysis?

Not really. It is Stata waiting for me to press a key! WTF!! It’s been like so for as many versions as I remember.

1 comment

Stata and accents (diacriticals)

Two times already I’ve emailed Stata support to complain about character encoding issues in Stata. In a nutshell, the problem is that if a dataset has diacriticals and was created it in a windows machine the characters will be mangled up in a mac, and vice versa. I assume there are similar problems in linux. This post is fairly long, keep reading if you have similar problems.

Read more

1 comment

Conditioning

Brian Mulloy, one of the founders of Swivel, wrote a nice comment on my post explaining how Swivel is in fact able to condition on data categories. For example, if you want to do a graph highlighting a particular category, or even using only data from a particular category, you are able to. The process has to start in the dataset view.

Dmitry Dimov calory and Dmitry Dimov costs

See his comment for the full explanation.

I guess I have to spend more time on it, but it still doesn’t seem to be able to do what I want. I downloaded my own data in csv format and created a couple of figures using the ggplot package in R. I don’t expect Swivel to have the same flexibility, since its objectives are very different from those of an academic statistical software. However, I don’t see why in the not so distant future something like this would be possible in a web application.

color by party

color by party, one plot per state

The code:

Read more

No comments

Stata’s outreg equivalent for R

[Update 4/9/2008 -- A comment below suggests using the R package memisc, function mtable. It seems to be a useful package overall (much more stuff than just tables), and the tables it generates look pretty good. I didn't know about it, thanks for pointing it out.]

A nice feature of STATA is the large number of ado files helping in the creation of tables of coefficient estimates that you can cut and paste into your (yuck!) Word document, or much more elegantly, produce LaTeX code for your table that you can include in your LaTeX document. R has a somewhat similar feature with the package xtable, but it currently lacks the ability of producing a single table in which the columns have the results from different models or specifications.

Some time ago Ajay Narottam Shah published some code at the R-help list to do just that. I took just part of it and tweaked a bit. You give it a matrix of coefficients and a matrix of standard errors and it produces the latex code.

Here it is (link):

latex.table <- function(coef.mat,se.mat,digits=3,table.command=TRUE) {
nc <- ncol(coef.mat)
coef.mat <- round(coef.mat,3)
se.mat <- round(se.mat,3)
text.now <- NULL
if (table.command) {
text.now <- c(text.now,"\\begin{table}\n")
text.now <- c(text.now,"\\centering\n")
}
text.now <- c(text.now,"\\begin{tabular}[R]{",rep("c",nc+1),"}\n")
text.now <- c(text.now,"\\hline\n")
for (j in 1:ncol(coef.mat)) {
text.now <- c(text.now," & ", colnames(coef.mat)[j])
}
text.now <- c(text.now,"\\\\\n\\hline\n")
for (i in 1:nrow(coef.mat)) {
##print coef estimates
text.now <- c(text.now,rownames(coef.mat)[i])
for (j in 1:ncol(coef.mat)) {
if (is.na(coef.mat[i,j])) {
text.now <- c(text.now," & ")
} else {
text.now <- c(text.now," & ", coef.mat[i, j])
}
}
text.now <- c(text.now,"\\\\\n")
## print SEs
for (j in 1:ncol(coef.mat)) {
if (is.na(se.mat[i,j])) {
text.now <- c(text.now," & ")
} else {
text.now <- c(text.now," & ", sprintf("(%s)", se.mat[i,j]))
}
}
text.now <- c(text.now,"\\\\[1mm]\n")
}
text.now <- c(text.now,"\\\\\n")
text.now <- c(text.now,"\\hline")
text.now <- c(text.now,"\n")
text.now <- c(text.now,"\\end{tabular}\n")
if (table.command) text.now <- c(text.now,"\\end{table}\n")
paste(text.now,collapse="")
}

So, for

tmp.estimates
cluster jack-knife lmer lmerMcmc edvreg
iquality -0.503 -0.503 -0.414 -0.413 -0.454
iqualityrep 0.030 0.030 0.019 0.019 0.019
gdppc 0.022 0.022 0.024 0.023 0.047

and

tmp.se

cluster jack-knife lmer lmerMcmc edvreg
1 0.1230 0.1623 0.1593 0.213 0.1905
2 0.0056 0.0065 0.0092 0.012 0.0092
3 0.0149 0.0280 0.0327 0.042 0.0634

latex.table(tmp.estimates,tmp.se,table.command=TRUE)

"\\begin{table}\n\\centering\n\\begin{tabular}[R]{cccccc}\n\\hline\n & cluster & jack-knife & lmer & lmerMcmc & edvreg\\\\\n\\hline\niquality & -0.503 & -0.503 & -0.414 & -0.413 & -0.454\\\\\n & (0.123) & (0.162) & (0.159) & (0.213) & (0.19)\\\\[1mm]\niqualityrep & 0.03 & 0.03 & 0.019 & 0.019 & 0.019\\\\\n & (0.006) & (0.006) & (0.009) & (0.012) & (0.009)\\\\[1mm]\ngdppc & 0.022 & 0.022 & 0.024 & 0.023 & 0.047\\\\\n & (0.015) & (0.028) & (0.033) & (0.042) & (0.063)\\\\[1mm]\n\\\\\n\\hline\n\\end{tabular}\n\\end{table}\n”

which you can dump to a file as:


cat(latex.table(tmp.estimates,tmp.se,table.command=TRUE),file="table.tex")

producing

table

Nifty!

2 comments

R in OS X - Making quartz device work from terminal or Emacs

buggy, but works! I was getting really tired of X11 in OS X…

You need the apple developer tools installed (comes in the Tiger DVD, or can be downloaded from Apple Developer Connection)

install.packages("CarbonEL",,'http://rforge.net/',type='source')
library(CarbonEL)

and then

quartz()

opens a graphic window in OS X.

No comments

Data Visualization for the Masses

There are a few websites/startup companies trying to fly the idea of being a repository and visualization engine for data that anyone can upload. Swivel, for example, made some splash in the past few months as the “youtube for data”.

I experimented with Swivel and a few others. The main problem with all of them is the lack of tools allowing conditioning, at least in an easy way. Conditioning is important for constructing small multiple plots, or even plotting groups in different colors in scatter plot. For example, take the ideal point data I uploaded:

2nd Dimension by 1st Dimension

In order to different parties in different colors I would have (as far as I know) to upload a different dataset for each party! It goes without saying that this is unnecessarily burdensome.

Many Eyes is also very impressive, with more advanced visualization plots. It is java based, and does not play so nicely in firefox at the mac, unfortunately. I also had problems with the ideal points dataset there. It doesn’t allow one to create a scatter plot with only two variables (!!!) requesting a third to be displayed as the size of the symbols.




The focus on bar charts on both platforms is also annoying… dotplots and boxplots would be nice.

There is another site data360, but I didn’t have much luck. It is more “professional”, allowing one to pull data directly from the web automatically. But its focus on time series data makes it simply unusable for the example I tried.

Of course, all three are just beginning and we might be hearing a lot about them soon. And the graphs they make are not half bad. Not until you consider rich chart live, that is. This one is ugly! And I mean, it makes you miss excel kind of ugly:

1 comment

Statistics and Programming

Last month at Machine Learning there was a discussion about the creation of a machine learning department at Canergie Mellon University. The discussion of the post was fairly interesting, in particular this pearl by John Langford:

I regard ‘rogramming as the missing member of reading, ‘riting, and ‘rithmetic, and I’ve found a statistical understanding of the world genuinely valuable.

I couldn’t agree more. As social scientists, we need a much better training in programming if we are to partake and benefit from the high paced increase in computer power available. The question is, how can we properly train graduate students in the social sciences, who tend to start the program with very little knowledge or interest in programming?

It seems to me that the classes and books available miss badly the mark. They either assume a lot of programming experience (e.g S Programming by Venables and Ripley and Programming with Data by John Chambers) or focus too much on the statistics side of things without a proper discussion of the basic programming concepts.

I am certainly not alone in this assessment. Back in November 2006, Jan de Leew made the following suggestion in the StatCompute mailing list:

This is the title of a series of free computer programming
textbooks, started by Alan Downey. There are versions
for C++, Java, Logo, and Python.

Since the LaTeX for these books is also freely available,
it may not be too hard (and possibly quite useful) to
make an R/S version. What seems to be common practice is to edit
the original LaTeX and then add your name to the author list.

But should (most) social scientists learn how to program? This was a topic of an extended discussion at PolMeth . The main argument against it was that we should leave programming to the “pros”, i.e. the programmers at Stata/SAS/R. We perhaps shouldn’t trust computer code written by us plain social scientists. My own take is that a large chunk of data analysis is simply indistinguishable from programming. The problem is that it currently mostly done via a non-reproducible, error prone and downright ugly way. Even the most basic understanding of flow control, loops and data structures will be a quantum leap for most of the current statistical practice, at least in political science.

Therefore, something should be done to cover the gap. Luckily this is not only a problem for us lowly social scientists. Alan Downey, the author of “How to Think Like a Computer Scientist”, is currently finishing up a book on “Physical Modeling in MATLAB” which appears to have the same basic idea, although focused on a “real” science and a different language. It is a “free book” covered by the GNU Free Documentation License, so it is possible to adapt it (or even combine it with part of the “How to Think…” series) in order to make it more applicable to social scientists.

No comments

Bivariate Probit with Endogenous Dummy

The basic model is as follows:

[tex] y_{1i}^{*}=\alpha z_i+\eta_{1i} [/tex]

[tex] y_{2i}^{*}=\beta x_i+\gamma y_{1i}+\eta_{2i} [/tex]

Where [tex]y_{1i}^{*}[/tex] and [tex] y_{2i}^{*} [/tex] are latent variables. Let [tex]1(m)[/tex] be an indicator function which equals one if [tex] m>0 [/tex] and zero otherwise. Let [tex]y_{1i}=1(y_{1i}^{*})[/tex] and [tex]y_{2i}=1(y_{2i}^{*})[/tex].

Note that it is [tex] y_{1i}[/tex] and not [tex] y_{1i}^{*} [/tex] which enters the equation for [tex] y_{2i}[/tex]. It turns out that one can estimate this model using a bivariate probit and ignoring the simultaneity (Greene 1998). I.e. it can be estimated using maximum likelihood by using [tex] y_{1i}[/tex] (not [tex] \hat y_{1i}^{*}[/tex], not even [tex] 1(\hat y_{1i}^{*})[/tex]) as an independent variable in the equation for [tex] y_{2i}^{*}[/tex].

A bivariate probit assumes that [tex] \eta_{\cdot i}[/tex] follow a bivariate normal distribution with mean zero, variance one and covariance [tex]\rho[/tex]. It is estimated by full information maximum likelihood. (eg. biprobit in STATA)

Please note that this is not a general finding. It just turns out that in this particular model one can safely ignore the simultaneity and straightforwardly estimate the bivariate model.

—-
!!References

”Gender Economics Courses in Liberal Arts Colleges: Further Results” William Greene. Journal of Economic Education, fall 1998.

”The effects of Catholic Secondary Schooling on Educational Achievement” Derek Neal, Journal of Labor Economics 1997.

Greene, W. Econometric Analysis 2001

Wooldridge, J.M. Econometric analysis of cross section and panel data MIT Press, 2002

No comments

Next Page »