"be consistent"

Re^2: Unicode and Regexps: convert or am I missing something?

by newrisedesigns (Curate)
on Jun 02, 2005 at 01:02 UTC ( #462711=note: print w/replies, xml ) Need Help??

in reply to Re: Unicode and Regexps: convert or am I missing something?
in thread Unicode and Regexps: convert or am I missing something?

Thanks for your reply.

According to the header, the returned data is UTF-16LE, which I assume stands for little-endian. I am on a Mac, so I guess I'm big-endian, which would explain why I was getting Asian glyphs instead of my Adsense results.

I tried the Encode method you suggested, also with the variation 'LE' after the 16 (why not, I've tried everything else, it seems) but it didn't work. The \p{Digit} does match, but fails when used in conjunction with the date field separator (/) like so: \p{Digit}\\/.

I guess the problem comes down to endian-ness of the data returned. How do I flip flop the data so that the methods available to me (Encode:: and /usr/bin/iconv) will work for me?

Replies are listed 'Best First'.
Re^3: Unicode and Regexps: convert or am I missing something?
by thundergnat (Deacon) on Jun 02, 2005 at 01:29 UTC

    UTF-16LE is supported by the Encoding module, so it should work... Did you try down converting it to Latin-1? The less often used encodings don't have as many aliases, you may need to be more careful about how the encoding is specified.

    Encode::from_to($string, 'UTF-16LE', 'utf8');

    should be ok, as should

    Encode::from_to($string, 'UTF-16LE', 'iso-8859-1');

    You only need to single escape the forward slash in the regex. (Or use alternate delimiters.)

    my $string = '5/18/05 184 7 3.8% 6.14 1.13'; if ($string =~ m#(\p{Digit}+/\p{Digit}+/\p{Digit}+)#){ print $1; }
Re^3: Unicode and Regexps: convert or am I missing something?
by dakkar (Hermit) on Jun 02, 2005 at 10:13 UTC
    use Encode; my $string=Encode::decode('UTF-16LE',$data_from_google); $string=~/what you want/;

    from_to is the wrong function to use. It converts between byte strings, but to correctly work with regexp you need character strings, so you need to use decode

            dakkar - Mobilis in mobile

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

