http://www.perlmonks.org?node_id=39205

webfiend has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a CGI to take requests from a user and poll an assortment of online resources to find a range of possible results (think of a Meta-search engine, and you're on the right track). It then proceeds to dissect the resulting HTML from each of these resources, and presents it to the user in a single, unified interface.

Most of it works quite smoothly, thanks to the magic of LWP::Parallel. There is one little snag, though. The result may span multiple languages and character sets, and I need to put everything in a character set that is capable of presenting everything from Western Latin to Shift JIS. I'm guessing that UTF8 (or UTF16) will work for that purpose.

Now that I've got the setup out of the way, it's time for the question itself:

Assuming I am able to determine a string's original encoding, how would I convert that string to UTF8 (or some other encoding)?

Any kind of information would be helpful, including pointers to modules and documentation for me to "RTFM" :)

Thanks,
webfiend

"All you need is ignorance and confidence; then success is sure." -- Mark Twain

Replies are listed 'Best First'.
Re: Character sets: converting to UTF8 with Perl 5.6?
by Fastolfe (Vicar) on Oct 31, 2000 at 02:23 UTC
    You may be interested in Unicode::MapUTF8, which bills itself as the module to do precisely this. I just did a search for "UTF8".

      Unicode::MapUTF8 is the first solution I'm trying out. I just thought I'd post a quick note for anyone else who might look at it.

      If you are installing verion 1.05, 'make test' will fail on a Perl 5.6 setup without this change:

      Try changing line 306 of lib/Unicode/MapUTF8.pm from 'if (! $u) {' to 'if (! defined $u) {'

      That advice comes direct from the author of the module himself, and did in fact allow 'make test' to finish peacefully.

      Note: This applies specifically to version 1.05 of the Unicode::MapUTF8 module. I can't tell you for sure about any earlier or later version.

      There. I just saved somebody else an hour of confusion. My good deed for the day is outta the way...

      Update: Version 1.06 is already out (boy, but he moves fast), and the problem I mentioned is now fixed. So I guess this installation note of mine has mostly historical value now...

      "All you need is ignorance and confidence; then success is sure." -- Mark Twain
Re: Character sets: converting to UTF8 with Perl 5.6?
by lhoward (Vicar) on Oct 31, 2000 at 02:35 UTC
    In addition to Fastolfe's suggestion you may want to check out the Lingua::Iconv module. It can convert to/from many diffrent character sets including UTF-8.
RE: Character sets: converting to UTF8 with Perl 5.6?
by Anonymous Monk on Nov 01, 2000 at 17:31 UTC
    use Text::Iconv. This is much more reliable under 5.6 than the various Unicode::* modules out there, although you have to patch it with s/sv_undef/PL_sv_undef/g, and ignore the test failures. (I've mailed the author on this and other issues). The other problem with the Map8 stuff is that the character set choices are case sensitive, which is a total pain in the ass when working with unknown sources of information.