Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

utf8 and perl 5.6

by domm (Chaplain)
on Apr 21, 2006 at 12:03 UTC ( #544864=perlquestion: print w/ replies, xml ) Need Help??
domm has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

So, I've got this legacy system running perl 5.6.something on a server I only have ftp access to. I haven't got perl 5.6 running anywhere to test stuff (or even read the docs). Now they need to parse an utf8-encoded RSS stream (which gets stored into the DB for further processing - don't ask...)

How do I decode a utf8 string in perl 5.6 to some iso-8859-* ?

Encode is not installed, utf8::decode doesn't work either. Encode requires 5.7.3, so it's not an option. Hmm, personally I would love to have them update to 5.8 but OTOH it's quite a hassle (recompile mod_perl, reinstall half of CPAN, ...)

Any hints, anyone?

-- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}

Comment on utf8 and perl 5.6
Download Code
Re: utf8 and perl 5.6
by zentara (Archbishop) on Apr 21, 2006 at 12:24 UTC
    I've seen this answer before from an more expert monk than me, but can't gaurantee it
    # On 5.6.x, it can be as simple as # $latin1 = pack 'C*', unpack 'U*', $utf8; #See the docs on pack for 5.6.1. Search for "C0".

    I'm not really a human, but I play one on earth. flash japh

      The important part being that utf8 in all its forms isn't even half-baked in 5.6. Don't bother.

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: utf8 and perl 5.6
by vkon (Deacon) on Apr 21, 2006 at 13:57 UTC
    do something like this (an adopted excerpt from my perl prog)
    use Unicode::String qw/utf8/; my $name = "трудная строка"; my $u=utf8($name); # ready; use it at your will: my $s = $u->hex; $s=~s/U\+00(\w\w)/my($r,$p)=((pack 'H*',$1),$&);if($r=~m(^[\w ]$)) +{$r}else{$p}/eg;
Re: utf8 and perl 5.6
by fraktalisman (Hermit) on Apr 21, 2006 at 15:25 UTC
      thanks, that seems quite usefull...
      -- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
Re: utf8 and perl 5.6
by doc_faustroll (Scribe) on Apr 21, 2006 at 16:03 UTC
    Looks like a feature request on a legacy app. Time for an upgrade, audit, rewrite. Or write a completely different app in 5.8 that you call for parsing.

    Wisdom would dictate pushing back against the pointy heads

    The hardest part of dev is often the political part.

      I agree, but the project is running for nearly three years and I doubt they have a budget for this. Plus, a quick hack to get it working with 5.6 costs me considerably less time than to migrate the whole app to 5.8, and I'd rather spend time with my kids in the park than sell my precious time for some money. I'm lazy, after all...

      Sometimes, the "perfect solution" isn't an option.

      -- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
Re: utf8 and perl 5.6
by davidnicol (Acolyte) on Apr 21, 2006 at 22:51 UTC
    there is a lot of code snippets around that turn utf8 back and forth between wide integers -- including parts of the recent perl releases. Translate some of that into pure perl. I'm not going to start enumerating the possible ways to iterate over the bytes in a string in this reply.
Re: utf8 and perl 5.6
by graff (Chancellor) on Apr 22, 2006 at 17:28 UTC
    What do you mean by "parse a utf8-encoded RSS stream"?

    If you mean there is wide-character content (utf8-encoded) in your input, and you need to translate that into an "equivalent" single-byte encoding, then you are being to vague about the problem. Can/does the input contain data in multiple languages, and might this require that you need to choose one or another "iso-8859-*" depending on the language? (There are sixteen different flavors of iso-8859; it could make a big difference whether you need just one of them or more than one of them for your input.)

    Also, depending on how unicode is being used in your source data, you might need something besides iso-8859, if you need to preserve stuff like "specialized" versions of quotation marks, dashes, etc (that's cp12* territory). Converting these to their plain-ASCII equivalents is easy enough, if appropriate, and just takes a little bit of study on what the data actually contains.

    The structure of utf8 is such that it is actually pretty easy to parse using binary methods (testing, masking and shifting specific bits). ASCII characters are just ASCII characters; every non-ASCII (wide) character is two or more consecutive bytes with the high-bit set, and the boundaries between consecutive wide characters are unambiguous, based on how many high-bits are set in a given byte. There's a pretty good explanation of this in the "Unicode Encodings" section of the perlunicode man page. The main unicode.org web site is also an excellent resource.

    So, if the tools on hand are insufficent to do "real" character encoding conversions, just do some research on the data to figure out what sorts of wide characters you are getting, and map out a hash table to convert those two- or three-byte patterns to whatever single-byte "equivalent" seems appropriate. If the input is likely to introduce "new" utf8 patterns over time, just come up with a method to flag wide characters that are not yet tabulated in your replacement hash, and have a procedure to do something appropriate with that information (e.g. get someone to figure out what the new character should be mapped to and add it to the replacement hash).

    OTOH, maybe it's sufficient just to store the utf8 data "as-is" in the database -- that is, don't try to "parse" it with the legacy system -- and have some other, more up-to-date system read from the DB in order to do whatever conversion needs to be done (or just use the data as utf8 text). The DB itself should be neutral about the byte values stored in a "varchar" field -- though you may want to define this field as having the "binary" attribute (cf. mysql docs on "binary text fields").

      The input data is pure German (well, Austrian) text, so it's only Umlauts and "scharfes s" that are causing problems. Oh, and maybe a french accent or two. Output should be ISO-8859-1

      Currently, as a first workaround, I did a quick hashlookup-regex-thingy, a bit like you suggested. But I guess I'll move to one of the CPAN modules suggested by others, as soon as I can get to sysadmin to install them (/me hates not having shell/su access to machines...)

      As writing to and reading from the DB is done by the same app, your last suggestion doesn't solve my problem.

      -- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
Re: utf8 and perl 5.6
by snowhare (Friar) on Apr 23, 2006 at 18:08 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://544864]
Approved by Tanalis
Front-paged by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2014-08-30 08:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls