Statistics and Programming
Last month at Machine Learning there was a discussion about the creation of a machine learning department at Canergie Mellon University. The discussion of the post was fairly interesting, in particular this pearl by John Langford:
I regard ‘rogramming as the missing member of reading, ‘riting, and ‘rithmetic, and I’ve found a statistical understanding of the world genuinely valuable.
I couldn’t agree more. As social scientists, we need a much better training in programming if we are to partake and benefit from the high paced increase in computer power available. The question is, how can we properly train graduate students in the social sciences, who tend to start the program with very little knowledge or interest in programming?
It seems to me that the classes and books available miss badly the mark. They either assume a lot of programming experience (e.g S Programming by Venables and Ripley and Programming with Data by John Chambers) or focus too much on the statistics side of things without a proper discussion of the basic programming concepts.
I am certainly not alone in this assessment. Back in November 2006, Jan de Leew made the following suggestion in the StatCompute mailing list:
This is the title of a series of free computer programming
textbooks, started by Alan Downey. There are versions
for C++, Java, Logo, and Python.
…
Since the LaTeX for these books is also freely available,
it may not be too hard (and possibly quite useful) to
make an R/S version. What seems to be common practice is to edit
the original LaTeX and then add your name to the author list.
But should (most) social scientists learn how to program? This was a topic of an extended discussion at PolMeth . The main argument against it was that we should leave programming to the “pros”, i.e. the programmers at Stata/SAS/R. We perhaps shouldn’t trust computer code written by us plain social scientists. My own take is that a large chunk of data analysis is simply indistinguishable from programming. The problem is that it currently mostly done via a non-reproducible, error prone and downright ugly way. Even the most basic understanding of flow control, loops and data structures will be a quantum leap for most of the current statistical practice, at least in political science.
Therefore, something should be done to cover the gap. Luckily this is not only a problem for us lowly social scientists. Alan Downey, the author of “How to Think Like a Computer Scientist”, is currently finishing up a book on “Physical Modeling in MATLAB” which appears to have the same basic idea, although focused on a “real” science and a different language. It is a “free book” covered by the GNU Free Documentation License, so it is possible to adapt it (or even combine it with part of the “How to Think…” series) in order to make it more applicable to social scientists.