Archive for May, 2007
Stata and accents (diacriticals)
Two times already I’ve emailed Stata support to complain about character encoding issues in Stata. In a nutshell, the problem is that if a dataset has diacriticals and was created it in a windows machine the characters will be mangled up in a mac, and vice versa. I assume there are similar problems in linux. This post is fairly long, keep reading if you have similar problems.
1 commentThe move
So, I decided to spend some time in back in my home country for a variety of reasons, and between the move and just watching the scenery, I haven’t been able to find time to post.
I had to buy a new computer to take with me and left my desktop pc, which lately has not been much more than a glorified jukebox, behind. Moving the iTunes library from the PC to my shiny new mac was not very painful, but the songs with diacriticals in the file names (ç˜^ç, etc) didn’t load correctly on the mac. So I wrote down this evil, evil, shell script, which goes down the directory structure renaming all your files and folders substituting diacriticals with regular ascii letters.
#!/bin/sh
find . -name '* *' | while read S1
do
S2=`echo "$S1" | unaccent`
if [ "$S1" != "$S2" ]
then
echo "renaming $S1 to $S2"
##mv "$S1" "$S2"
fi
done
If you want the actual renaming to be done, uncomment the line with the command mv "$S1" "$S2" You will also need this rather useful perl script to remove the diacriticals (should be called unaccent and be on your path) (from http://ahinea.com/en/tech/perl-unicode-struggle.html)
#!/usr/bin/perl -w
require Encode;
use Unicode::Normalize;
#$str = ;
#for ( $str ) { # the variable we work on
while (<>) { # the variable we work on
## convert to Unicode first
## if your data comes in Latin-1, then uncomment:
#$_ = Encode::decode( 'iso-8859-1', $_ );
s/\xe4/ae/g; ## treat characters ä ñ ö ü ÿ
s/\xf1/ny/g; ## this was wrong in previous version of this doc
s/\xf6/oe/g;
s/\xfc/ue/g;
s/\xff/yu/g;
$_ = NFD( $_ ); ## decompose (Unicode Normalization Form D)
s/\pM//g; ## strip combining characters
# additional normalizations:
s/\x{00df}/ss/g; ## German beta “ß” -> “ss”
s/\x{00c6}/AE/g; ## Æ
s/\x{00e6}/ae/g; ## æ
s/\x{0132}/IJ/g; ## IJ
s/\x{0133}/ij/g; ## ij
s/\x{0152}/Oe/g; ## Œ
s/\x{0153}/oe/g; ## œ
tr/\x{00d0}\x{0110}\x{00f0}\x{0111}\x{0126}\x{0127}/DDddHh/; # ÐĐðđĦħ
tr/\x{0131}\x{0138}\x{013f}\x{0141}\x{0140}\x{0142}/ikLLll/; # ıĸĿŁŀł
tr/\x{014a}\x{0149}\x{014b}\x{00d8}\x{00f8}\x{017f}/NnnOos/; # ŊʼnŋØøſ
tr/\x{00de}\x{0166}\x{00fe}\x{0167}/TTtt/; # ÞŦþŧ
s/[^\0-\x80]//g; ## clear everything else; optional
print $_;
}
My external hard disk with my music is formated in NTFS, so I needed a way to get write access to it if I wanted my files to be kept there. I managed to do it flawlessly following these instructions. Macfuse (the tool doing the grunt work) appears to be cpu intensive, but for the relatively small music files it worked fine. If you have questions, I would be glad to write up more carefully what I did. Leave a comment (you have to be registered, but anyone can register) or email me at e dot leoni at gmail dot com .
No comments