Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

ISO 8859-1 characters and \w \b etc.

by Melroch (Acolyte)
on Jun 27, 2004 at 17:30 UTC ( [id://370008]=perlquestion: print w/replies, xml ) Need Help??

Melroch has asked for the wisdom of the Perl Monks concerning the following question:

How can I get \w \b and other character range escapes to correctly identify ISO 8859-1 additional letters for European languages as alphanumeric characters?

It's very irritating not to be able to use \b and \w with these letters (which are letters to us)!

Again I speak neither English nor Perl natively, so please have forbearance!

TIA,

/Melroch

Replies are listed 'Best First'.
Re: ISO 8859-1 characters and \w \b etc.
by graff (Chancellor) on Jun 28, 2004 at 02:01 UTC
    If you have Perl 5.8, you could convert the input 8859-1 data to utf8, do the regex matching, and convert back to 8859-1 for output (assuming you don't want to just switch everything over to utf8 globally). Something like this would work:
    #!/usr/bin/perl use strict; use Encode; while (<>) { my $utf8 = decode( 'iso8859-1', $_ ); my @words = ( $utf8 =~ /\b(\w+)\b/g ); print join "\n", map { encode( 'iso8859-1', $_ ) } @words; print "\n"; }
    The output is one "word" per line, treating accented letters as "\w", and things such as currrency symbols, quotes, inverted question mark, non-breaking-space, etc, as things that trigger "\b".

    Assuming the 8859-1 text is in a file, the example above works as follows (let's call the script "latin1-tokenizer"):

    latin1-tokenizer < latin1.txt > latin1.tkns
    That example could also be written without the encode/decode calls, using PerlIO layers instead:
    #!/usr/bin/perl use strict; open( IN, "<:encoding(iso8859-1)", $ARGV[0] ) or die "couldn't read $A +RGV[0]: $!"; binmode STDOUT, ":encoding(iso8859-1)"; while (<IN>) { my @words = ( /\b(\w+)\b/g ); print join "\n", @words; print "\n"; } # run it like this: tokenizer latin1.txt > latin1.tkns
    (I'm unsure about posting test data with actual latin1 characters, so I leave it to you to try it on your own data.)
Re: ISO 8859-1 characters and \w \b etc.
by Joost (Canon) on Jun 27, 2004 at 17:42 UTC
      I know that this node is somewhat duplicate of things said above, but as a ISO-8859-2 user, I would like to emphasize that you will need to setup LC_CTYPE and use locale. You should also consider setuping LC_COLLATE so that sort also uses locale.

      You will have to have locale installed on your system. Try setting enviroment variables and running perl -v to see if perl picks up locale (it will complain if it doesn't).

      Having said that, locale setup is done per language and country (that's why locale for Croatia is hr_HR and for USA en_US). You might also use locale aliases (defined in /usr/lib/X11/locale/locale.alias).

      It might be enough just to add use locale; in your code. If you need. Example follows (for Croatia with it's funny accented characters; we use ISO-8859-2, but principle is the same).

      #!/usr/bin/perl -w use strict; use locale; use POSIX qw(locale_h); setlocale(LC_CTYPE, 'hr_HR'); setlocale(LC_COLLATE, 'hr_HR'); my $text = "foo &#269;evap&#269;i&#263; bar"; print join(", ",sort split(/\W/,$text)),"\n";
      If you are not bothered with changing system-wide locale, you can also setup your /etc/profile and apache's httpd.conf with enviroment variables and drop setlocale from code.
      2share!2flame...

      Thanks. Truth to say I have looked at perldoc perllocale several times and not got any wiser, I'm afraid.

      I guess what I'm really looking for is a plain English description of how to get and set locales. The workaround of using numerals instead of letters only gets you so far...

      /Melroch

        See the ENVIRONMENT secion in perllocale, and maybe your local manpage for "locale". You might have to install extra locales you want to use (my system only has the "C" and "POSIX" locales, apparently). Basically you can set a couple of environment variables, and that will determine the locale your perl program will run under. Which locales are supported is system-dependent, I can see mine using "locale -a".

        Hope this helps,
        Joost.

Re: ISO 8859-1 characters and \w \b etc.
by theorbtwo (Prior) on Jun 28, 2004 at 01:44 UTC

    Here's the answer, and it's a bit confusing. A perl string has a magic bit attached to it, the UTF8 bit. If it is off, your string is assumed to be in latin1. That's fairly clear. What's not clear is that when the string is a bunch of utf8 chars, ö is considered a letter (for example), but when it's latin1 characters, ö is not a letter (unless using locale).

    The solution is to make your strings utf8 strings, by using Encode.


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://370008]
Approved by Happy-the-monk
Front-paged by bart
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-26 05:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found