Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: utf8 and perl 5.6

by graff (Chancellor)
on Apr 22, 2006 at 17:28 UTC ( #545061=note: print w/replies, xml ) Need Help??

in reply to utf8 and perl 5.6

What do you mean by "parse a utf8-encoded RSS stream"?

If you mean there is wide-character content (utf8-encoded) in your input, and you need to translate that into an "equivalent" single-byte encoding, then you are being to vague about the problem. Can/does the input contain data in multiple languages, and might this require that you need to choose one or another "iso-8859-*" depending on the language? (There are sixteen different flavors of iso-8859; it could make a big difference whether you need just one of them or more than one of them for your input.)

Also, depending on how unicode is being used in your source data, you might need something besides iso-8859, if you need to preserve stuff like "specialized" versions of quotation marks, dashes, etc (that's cp12* territory). Converting these to their plain-ASCII equivalents is easy enough, if appropriate, and just takes a little bit of study on what the data actually contains.

The structure of utf8 is such that it is actually pretty easy to parse using binary methods (testing, masking and shifting specific bits). ASCII characters are just ASCII characters; every non-ASCII (wide) character is two or more consecutive bytes with the high-bit set, and the boundaries between consecutive wide characters are unambiguous, based on how many high-bits are set in a given byte. There's a pretty good explanation of this in the "Unicode Encodings" section of the perlunicode man page. The main web site is also an excellent resource.

So, if the tools on hand are insufficent to do "real" character encoding conversions, just do some research on the data to figure out what sorts of wide characters you are getting, and map out a hash table to convert those two- or three-byte patterns to whatever single-byte "equivalent" seems appropriate. If the input is likely to introduce "new" utf8 patterns over time, just come up with a method to flag wide characters that are not yet tabulated in your replacement hash, and have a procedure to do something appropriate with that information (e.g. get someone to figure out what the new character should be mapped to and add it to the replacement hash).

OTOH, maybe it's sufficient just to store the utf8 data "as-is" in the database -- that is, don't try to "parse" it with the legacy system -- and have some other, more up-to-date system read from the DB in order to do whatever conversion needs to be done (or just use the data as utf8 text). The DB itself should be neutral about the byte values stored in a "varchar" field -- though you may want to define this field as having the "binary" attribute (cf. mysql docs on "binary text fields").

Replies are listed 'Best First'.
Re^2: utf8 and perl 5.6
by domm (Chaplain) on Apr 23, 2006 at 19:03 UTC

    The input data is pure German (well, Austrian) text, so it's only Umlauts and "scharfes s" that are causing problems. Oh, and maybe a french accent or two. Output should be ISO-8859-1

    Currently, as a first workaround, I did a quick hashlookup-regex-thingy, a bit like you suggested. But I guess I'll move to one of the CPAN modules suggested by others, as soon as I can get to sysadmin to install them (/me hates not having shell/su access to machines...)

    As writing to and reading from the DB is done by the same app, your last suggestion doesn't solve my problem.

    -- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://545061]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2021-09-20 06:24 GMT
Find Nodes?
    Voting Booth?

    No recent polls found