Re: utf8 and perl 5.6
by zentara (Archbishop) on Apr 21, 2006 at 12:24 UTC
|
I've seen this answer before from an more expert monk than me, but can't gaurantee it
# On 5.6.x, it can be as simple as
# $latin1 = pack 'C*', unpack 'U*', $utf8;
#See the docs on pack for 5.6.1. Search for "C0".
I'm not really a human, but I play one on earth.
flash japh
| [reply] [d/l] |
|
| [reply] |
Re: utf8 and perl 5.6
by vkon (Curate) on Apr 21, 2006 at 13:57 UTC
|
do something like this (an adopted excerpt from my perl prog)
use Unicode::String qw/utf8/;
my $name = "трудная строка";
my $u=utf8($name);
# ready; use it at your will:
my $s = $u->hex;
$s=~s/U\+00(\w\w)/my($r,$p)=((pack 'H*',$1),$&);if($r=~m(^[\w ]$))
+{$r}else{$p}/eg;
| [reply] [d/l] |
Re: utf8 and perl 5.6
by fraktalisman (Hermit) on Apr 21, 2006 at 15:25 UTC
|
Look at the similar discussion: Converting character encodings. With sub latin1 in the code examples you might even manage to do it in 5.6 without relying on additional modules.
| [reply] [d/l] |
|
thanks, that seems quite usefull...
--
#!/usr/bin/perl
for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
| [reply] [d/l] |
Re: utf8 and perl 5.6
by doc_faustroll (Scribe) on Apr 21, 2006 at 16:03 UTC
|
Looks like a feature request on a legacy app. Time for an upgrade, audit, rewrite. Or write a completely different app in 5.8 that you call for parsing.
Wisdom would dictate pushing back against the pointy heads
The hardest part of dev is often the political part.
| [reply] |
|
I agree, but the project is running for nearly three years and I doubt they have a budget for this. Plus, a quick hack to get it working with 5.6 costs me considerably less time than to migrate the whole app to 5.8, and I'd rather spend time with my kids in the park than sell my precious time for some money. I'm lazy, after all...
Sometimes, the "perfect solution" isn't an option.
--
#!/usr/bin/perl
for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
| [reply] [d/l] |
Re: utf8 and perl 5.6
by graff (Chancellor) on Apr 22, 2006 at 17:28 UTC
|
What do you mean by "parse a utf8-encoded RSS stream"?
If you mean there is wide-character content (utf8-encoded) in your input, and you need to translate that into an "equivalent" single-byte encoding, then you are being to vague about the problem. Can/does the input contain data in multiple languages, and might this require that you need to choose one or another "iso-8859-*" depending on the language? (There are sixteen different flavors of iso-8859; it could make a big difference whether you need just one of them or more than one of them for your input.)
Also, depending on how unicode is being used in your source data, you might need something besides iso-8859, if you need to preserve stuff like "specialized" versions of quotation marks, dashes, etc (that's cp12* territory). Converting these to their plain-ASCII equivalents is easy enough, if appropriate, and just takes a little bit of study on what the data actually contains.
The structure of utf8 is such that it is actually pretty easy to parse using binary methods (testing, masking and shifting specific bits). ASCII characters are just ASCII characters; every non-ASCII (wide) character is two or more consecutive bytes with the high-bit set, and the boundaries between consecutive wide characters are unambiguous, based on how many high-bits are set in a given byte. There's a pretty good explanation of this in the "Unicode Encodings" section of the perlunicode man page. The main unicode.org web site is also an excellent resource.
So, if the tools on hand are insufficent to do "real" character encoding conversions, just do some research on the data to figure out what sorts of wide characters you are getting, and map out a hash table to convert those two- or three-byte patterns to whatever single-byte "equivalent" seems appropriate. If the input is likely to introduce "new" utf8 patterns over time, just come up with a method to flag wide characters that are not yet tabulated in your replacement hash, and have a procedure to do something appropriate with that information (e.g. get someone to figure out what the new character should be mapped to and add it to the replacement hash).
OTOH, maybe it's sufficient just to store the utf8 data "as-is" in the database -- that is, don't try to "parse" it with the legacy system -- and have some other, more up-to-date system read from the DB in order to do whatever conversion needs to be done (or just use the data as utf8 text). The DB itself should be neutral about the byte values stored in a "varchar" field -- though you may want to define this field as having the "binary" attribute (cf. mysql docs on "binary text fields"). | [reply] |
|
The input data is pure German (well, Austrian) text, so it's only Umlauts and "scharfes s" that are causing problems. Oh, and maybe a french accent or two. Output should be ISO-8859-1
Currently, as a first workaround, I did a quick hashlookup-regex-thingy, a bit like you suggested. But I guess I'll move to one of the CPAN modules suggested by others, as soon as I can get to sysadmin to install them (/me hates not having shell/su access to machines...)
As writing to and reading from the DB is done by the same app, your last suggestion doesn't solve my problem.
--
#!/usr/bin/perl
for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
| [reply] [d/l] |
Re: utf8 and perl 5.6
by davidnicol (Acolyte) on Apr 21, 2006 at 22:51 UTC
|
there is a lot of code snippets around that turn utf8
back and forth between wide integers -- including parts
of the recent perl releases. Translate some
of that into pure perl. I'm not going to start enumerating
the possible ways to iterate over the bytes in a string in
this reply. | [reply] |
Re: utf8 and perl 5.6
by snowhare (Friar) on Apr 23, 2006 at 18:08 UTC
|
| [reply] |