Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Unicode and Regexps: convert or am I missing something?

by newrisedesigns (Curate)
on Jun 01, 2005 at 21:52 UTC ( #462670=perlquestion: print w/replies, xml ) Need Help??

newrisedesigns has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I have a script that gets data from Google Adsense. The data is in Unicode (UTF-16, I believe). When I try to pattern match on the data, I can only match one character. A pattern that looks for more than one character in sequence fails.

A typical line looks like:

5/18/05     184     7       3.8%    6.14    1.13

Matching \d works, but attempting to match \d{2}, \d+\/ or anything else that catches two characters in sequence fails. I take it this is because Unicode uses more than one byte per character.

I'm only extracting data from this Unicode text, and do not need to output Unicode. Why don't the regexps work? If they're not supposed to work, how can I convert the text to ISO-8859-1/Latin1? I tried converting using iconv, but to no avail (would return UTF-16 regardless of args (used -f UTF-16 -t UTF-8).

Thanks in advance for your help.

Replies are listed 'Best First'.
Re: Unicode and Regexps: convert or am I missing something?
by thundergnat (Deacon) on Jun 02, 2005 at 00:24 UTC

    Are you sure the text is UTF-16? If so, your best bet is probably to convert to UTF-8. Perl 5.8 and above handle utf-8 natively, and utf-8 and utf-16 have a 1-to-1 character correspondence, so there won't be any encoding issues. Use Encode to handle it easily.

    use Encode; my $string; Encode::from_to($string, 'utf-16', 'utf-8');

    UTF-8 latin digits are code equivalent to iso-8859-1 digits so the same regex should match either.

    If you want to be extra sure to find Unicode digits, you can use named property assertions which will automatically use search in Unicode context.

    my @digets = $string =~ /\p{Digit}+/g;

    Just be aware that that regex will also find digits in other character blocks too if there are any in the string.

      Thanks for your reply.

      According to the header, the returned data is UTF-16LE, which I assume stands for little-endian. I am on a Mac, so I guess I'm big-endian, which would explain why I was getting Asian glyphs instead of my Adsense results.

      I tried the Encode method you suggested, also with the variation 'LE' after the 16 (why not, I've tried everything else, it seems) but it didn't work. The \p{Digit} does match, but fails when used in conjunction with the date field separator (/) like so: \p{Digit}\\/.

      I guess the problem comes down to endian-ness of the data returned. How do I flip flop the data so that the methods available to me (Encode:: and /usr/bin/iconv) will work for me?

        UTF-16LE is supported by the Encoding module, so it should work... Did you try down converting it to Latin-1? The less often used encodings don't have as many aliases, you may need to be more careful about how the encoding is specified.

        Encode::from_to($string, 'UTF-16LE', 'utf8');

        should be ok, as should

        Encode::from_to($string, 'UTF-16LE', 'iso-8859-1');

        You only need to single escape the forward slash in the regex. (Or use alternate delimiters.)

        my $string = '5/18/05 184 7 3.8% 6.14 1.13'; if ($string =~ m#(\p{Digit}+/\p{Digit}+/\p{Digit}+)#){ print $1; }
        use Encode; my $string=Encode::decode('UTF-16LE',$data_from_google); $string=~/what you want/;

        from_to is the wrong function to use. It converts between byte strings, but to correctly work with regexp you need character strings, so you need to use decode

                dakkar - Mobilis in mobile

        Most of my code is tested...

        Perl is strongly typed, it just has very few types (Dan)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://462670]
Approved by polettix
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2022-05-16 12:43 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (63 votes). Check out past polls.