Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Unicode and Regexps: convert or am I missing something?

by newrisedesigns (Curate)
on Jun 02, 2005 at 01:02 UTC ( #462711=note: print w/ replies, xml ) Need Help??


in reply to Re: Unicode and Regexps: convert or am I missing something?
in thread Unicode and Regexps: convert or am I missing something?

Thanks for your reply.

According to the header, the returned data is UTF-16LE, which I assume stands for little-endian. I am on a Mac, so I guess I'm big-endian, which would explain why I was getting Asian glyphs instead of my Adsense results.

I tried the Encode method you suggested, also with the variation 'LE' after the 16 (why not, I've tried everything else, it seems) but it didn't work. The \p{Digit} does match, but fails when used in conjunction with the date field separator (/) like so: \p{Digit}\\/.

I guess the problem comes down to endian-ness of the data returned. How do I flip flop the data so that the methods available to me (Encode:: and /usr/bin/iconv) will work for me?


Comment on Re^2: Unicode and Regexps: convert or am I missing something?
Download Code
Replies are listed 'Best First'.
Re^3: Unicode and Regexps: convert or am I missing something?
by thundergnat (Deacon) on Jun 02, 2005 at 01:29 UTC

    UTF-16LE is supported by the Encoding module, so it should work... Did you try down converting it to Latin-1? The less often used encodings don't have as many aliases, you may need to be more careful about how the encoding is specified.

    Encode::from_to($string, 'UTF-16LE', 'utf8');

    should be ok, as should

    Encode::from_to($string, 'UTF-16LE', 'iso-8859-1');

    You only need to single escape the forward slash in the regex. (Or use alternate delimiters.)

    my $string = '5/18/05 184 7 3.8% 6.14 1.13'; if ($string =~ m#(\p{Digit}+/\p{Digit}+/\p{Digit}+)#){ print $1; }
Re^3: Unicode and Regexps: convert or am I missing something?
by dakkar (Hermit) on Jun 02, 2005 at 10:13 UTC
    use Encode; my $string=Encode::decode('UTF-16LE',$data_from_google); $string=~/what you want/;

    from_to is the wrong function to use. It converts between byte strings, but to correctly work with regexp you need character strings, so you need to use decode

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://462711]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (12)
As of 2015-07-29 19:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (267 votes), past polls