Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Unicode word wrapping

by lestrrat (Deacon)
on Dec 09, 2002 at 08:14 UTC ( [id://218470]=perlquestion: print w/replies, xml ) Need Help??

lestrrat has asked for the wisdom of the Perl Monks concerning the following question:

I'm hitting a block trying to format text that is sent to me as a CGI paramter.

Here's the scenario: user types something into a form, we process it , and then print out the HTML results -- which includes that text that the user just submitted, verbatim. The easiest way is probably to just surround that text using some auto-wrapping HTML tag, but we want to use a <pre> to enclose the text AND also disallow bad formatting that exceeds a certain column size

This is easily done on a ascii character set, but I'm having a hell of a time trying to do this on a Japanese string encoded in unicode. Does anyone have an idea as to how to do this?

Replies are listed 'Best First'.
Re: Unicode word wrapping
by seattlejohn (Deacon) on Dec 09, 2002 at 08:43 UTC
    What version of Perl are you using? 5.6.x exhibits some weird behavior with respect to Unicode. In particular, as perldoc perlunicode explains:

    Input and Output Disciplines
    There is currently no easy way to mark data read from a file or other external source as being utf8. This will be one of the major areas of focus in the near future.

    So part of the problem may be this: You expect your query parameter is encoded in UTF-8 (I'm assuming), but your script just sees a sequence of extended-ASCII characters. You might be able to get around this by explicitly using pack "U",... to reconstruct UTF-8 characters from the input one at a time, but I don't recall if I ever got that technique to work reliably.

    If you're just trying to ensure that an input string doesn't exceed a particular character length, you should be able to use length($string) to get its length in characters rather than bytes. That assumes that you already have it stored internally as UTF-8, of course, and that you haven't done a use bytes.

    Unicode support in 5.8 is supposed to be much improved, but I haven't yet had a chance to try it for myself yet.

            $perlmonks{seattlejohn} = 'John Clyman';

      Sorry, I guess I wasn't clear.

      First, I'm using Perl 5.8

      As for the input from the CGI, it's originally in EUC-JP, and then I change it to when I receive it from the browser to UTF-8. This is because we eventually shove it in XML format. I want to wrap THAT utf8 string at a certain column

Re: Unicode word wrapping
by rasta (Hermit) on Dec 09, 2002 at 09:24 UTC
    I'm unsure I unerstood the problem in the right way. But from my point of view I could advice generate all your pages with the following line in the HEAD
    <META NAME="Content-Type" CONTENT="text/HTML;CHARSET=utf-8">
    or add
    Content-Type: text/html; charset=utf-8
    to the HTTP header. And produce all your mages in UTF-8.

    It will cause automatically switching of the user agent to UFT-8, and so you will have no problem with many codepages.

    Although take into consideration that specifying different charsets in HTTP and HTML can cause expected behavor of browsers.

    -- Yuriy Syrota
Re: Unicode word wrapping
by theorbtwo (Prior) on Dec 09, 2002 at 14:45 UTC

    There's a well-defined method for doing this in Unicode: UAX#14: Line Breaking Properties.

    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

Re: Unicode word wrapping
by ph0enix (Friar) on Dec 09, 2002 at 09:40 UTC

    I don't know what can be wrong...

    What about following code?

    ... use utf8; my $max_text_len = 40; ... sub split_text { my $intext = shift || return ''; my @result = (); my $index = 0; foreach my $word (split(' ', $intext)) { next if !length($word); if (length("$result[$index] $word") > $max_text_len) { $index++; $result[$index] = $word; } else { $result[$index] .= ($result[$index] ? ' ' : '').$word; } } return join("\n", @result); }
Re: Unicode word wrapping
by dingus (Friar) on Dec 09, 2002 at 12:15 UTC
    Remember that because Japanese doesn't really have spaces between words you can split it ANYWHERE without a problem. So assuming you detect some Japanese then you rule should simpley be instert a \n after every N characters if there isn't one.


    Enter any 47-digit prime number to continue.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://218470]
Approved by tadman
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-07-20 09:01 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.