Blog

Crushing UTF-8 into ASCII

Sometimes it really doesn’t matter about the lost context, especially when perl doesn’t recognise ñ as a lower case Ñ unless you jump through all sorts of locale hoops, even though it’s in latin-1 and should be easy. This means I can’t just uc() the input to group all the case variations because uc(peña) => ‘PEñA’. Then accurate case-sensitive parsers reading my output think my PEñA is PEÑA (which it should be). So if everything goes to PENA that’s fine for this case. This method uses core Perl 5.8.

It might not be the best method, but it does seem to work on my very large international input file when I wanted to convert Peña/PEÑA to PENA and not PEñA.

use Unicode::Normalize;
foreach () {
$_ = NFD(decode_utf8($_));
s/pM//g;
s/[^ -x80]//g;
}

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Comment replies are not available offline