"be consistent"

Unicode word wrapping

by lestrrat (Deacon)
on Dec 09, 2002 at 08:14 UTC

I'm hitting a block trying to format text that is sent to me as a CGI paramter.

Here's the scenario: user types something into a form, we process it , and then print out the HTML results -- which includes that text that the user just submitted, verbatim. The easiest way is probably to just surround that text using some auto-wrapping HTML tag, but we want to use a <pre> to enclose the text AND also disallow bad formatting that exceeds a certain column size

This is easily done on a ascii character set, but I'm having a hell of a time trying to do this on a Japanese string encoded in unicode. Does anyone have an idea as to how to do this?

Re: Unicode word wrapping
by seattlejohn (Deacon) on Dec 09, 2002 at 08:43 UTC
    What version of Perl are you using? 5.6.x exhibits some weird behavior with respect to Unicode. In particular, as perldoc perlunicode explains:

    Input and Output Disciplines
    There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.

    So part of the problem may be this: You expect your query parameter is encoded in UTF-8 (I'm assuming), but your script just sees a sequence of extended-ASCII characters. You might be able to get around this by explicitly using pack "U",... to reconstruct UTF-8 characters from the input one at a time, but I don't recall if I ever got that technique to work reliably.

    If you're just trying to ensure that an input string doesn't exceed a particular character length, you should be able to use length($string) to get its length in characters rather than bytes. That assumes that you already have it stored internally as UTF-8, of course, and that you haven't done a use bytes.

    Unicode support in 5.8 is supposed to be much improved, but I haven't yet had a chance to try it for myself yet.

      Sorry, I guess I wasn't clear.

      First, I'm using Perl 5.8

      As for the input from the CGI, it's originally in EUC-JP, and then I change it to when I receive it from the browser to UTF-8. This is because we eventually shove it in XML format. I want to wrap THAT utf8 string at a certain column

Re: Unicode word wrapping
by rasta (Hermit) on Dec 09, 2002 at 09:24 UTC
    I'm unsure I unerstood the problem in the right way. But from my point of view I could advice generate all your pages with the following line in the HEAD
    <META NAME="Content-Type" CONTENT="text/HTML;CHARSET=utf-8">
    or add
    Content-Type: text/html; charset=utf-8
    to the HTTP header. And produce all your mages in UTF-8.

    It will cause automatically switching of the user agent to UFT-8, and so you will have no problem with many codepages.

    Although take into consideration that specifying different charsets in HTTP and HTML can cause expected behavor of browsers.

Re: Unicode word wrapping
by theorbtwo (Prior) on Dec 09, 2002 at 14:45 UTC

    There's a well-defined method for doing this in Unicode: UAX#14: Line Breaking Properties.

Re: Unicode word wrapping
by ph0enix (Friar) on Dec 09, 2002 at 09:40 UTC

    I don't know what can be wrong...

    What about following code?

    ... use utf8; my $max_text_len = 40; ... sub split_text { my $intext = shift || return ''; my @result = (); my $index = 0; foreach my $word (split(' ', $intext)) { next if !length($word); if (length("$result[$index] $word") > $max_text_len) { $index++; $result[$index] = $word; } else { $result[$index] .= ($result[$index] ? ' ' : '').$word; } } return join("\n", @result); }
Re: Unicode word wrapping
by dingus (Friar) on Dec 09, 2002 at 12:15 UTC
    Remember that because Japanese doesn't really have spaces between words you can split it ANYWHERE without a problem. So assuming you detect some Japanese then you rule should simpley be instert a \n after every N characters if there isn't one.


