cluelessresearch.com

political methodology, brazilian politics, etc.

Stata and accents (diacriticals)

Two times already I’ve emailed Stata support to complain about character encoding issues in Stata. In a nutshell, the problem is that if a dataset has diacriticals and was created it in a windows machine the characters will be mangled up in a mac, and vice versa. I assume there are similar problems in linux. This post is fairly long, keep reading if you have similar problems.

The first unhelpful answer I received was as follows.

Are the fonts the same in each instance of Stata? You can try changing the
font in each window by clicking the menu box (small white box) to the left of
the window title.

If changing the font doesn’t work, please send me an example of your data and
I’ll see if I can find a charset/font that works for the data in Windows.

(June 2005)

Sure, now the fonts have to be the same in Windows in OS X. But wait… Stata supplies the fonts itself! I am sure that’s not it, but even if it was, it would be Stata’s fault!

Again in April 2006:

Unfortunately Stata only supports ASCII fonts. If you use a Unicode font,
then it will not show up correctly in Stata. Since Stata files are binary you
should not have a problem transfering them between operating systems. You
might want to make sure you are using the same font on both our Windows and Mac
machines.

(bangs head against the wall)

I then explained to them the problem in more detail. You see, there are different coding systems, which are mappings between characters and bytes. The problem is that there is no unified standard (at least until unicode is more widely used). In particular, Windows and Mac systems use slightly different versions. They are the same for the characters covered by ASCII (A-Z a-z 0-9 and a few dozen more), characters with tildes, cedillas and the like get screwed up when saving in one system and viewing in another. Very annoying.

The worse is that Stata uses a binary format, very hard to manipulate outside it. In the many times it changed the format in a evil Microsoft like fashion, requiring one to upgrade, they could at least have addressed that. Come on, even Word does not have problems like that! The unhelpful Stata’s help desk promised me to communicate my problems to the “developers”. I am sure better sit waiting for their response. It will likely come in an upgrade with a $100+ (student price) tag. And oh, I won’t be a student anymore. Well, Stata 9 was probably the last version of Stata I will ever buy…

Anyway, in Stata 9 there is a way around this that has been working fine for me. To recap, the objective is to translate the encoding of a dataset with all its labels, variable names, etc. Thus, simply exporting the data in a text file (e.g. comma separated values) and converting that with other tools won’t work. (You lose all the labels). The solution is to save it as xml, which is a general purpose markup language that is text based and keeps all the information. Stata 9 can save the data as xml, which is the key ingredient in the solution I came up with.

In a mac, to convert a dataset from windows (latin 1) to mac (mac roman … Damn Stata, the encoding in OS X is Unicode by default!!!) is really simple, because OS X provides the tool we need. Let’s say you have a dataset “datawin.dta” that has windows encoding.

*datawin has windows encoding
use datawin, clear
*save it as xml
xmlsave dataxmlWin, doctype(dta)
*use OS X tool to translate
system iconv -f latin1 -t macroman dataxmlWin.xml > dataxmlMac.xml
*important, clear the data, labels, etc
*from memory
*(i.e. xmluse data,clear is not enough)
clear
*load the translated data
xmluse dataxmlMac.xml, doctype(dta)
*VoilĂ ! You can now save the data in stata's binary format again.
save datamac, replace

Mac users should graciously convert it back to Latin1 for windows folks. They already have enough problems on their hands thanks to Microsoft.

What I’ve learned:

1) open document formats are great. What saved me was the embrace, however partial, of open standards by Stata Corp. Closed document standard screws people over everyday in the Microsoft world.

2) OS X is great, not the least because it comes with lots of unix goodies.

3) more about character encodings than I’d ever thought I would need. And I’ve barely scratched the surface — just think about chinese!

ps. Incidentally, although R has all the tools to translate encodings and the like, has no automated way that I know to handle them. I think encoding should be an attribute of every object, or R should just enforce unicode across the board.

1 Comment so far

  1. mjm October 3rd, 2007 4:17 pm

    Thanks for this — I just discovered that a stupid data file I acquired in SPSS was encoded right, but not tagged right; and spss doesn’t export to anything helpful. This little trick definitely helps and now this table is in mysql where it belongs (properly utf8-ified even if various DBD layers don’t tag it). Honestly, for most of our data plain utf8 csv or sql would be a heck of a lot more useful than crappy binary formats that people treat as standards.

    I also have transfered files with ntfs3g/fuse btw with no problem… but that’s after the hassle of building both from source, so I know it’s the same libiconv.

Leave a reply

You must be logged in to post a comment.