Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

The joys of I18N. (That's internationalisation - I, 18 characters, N - for those who have been fortunate enough not to have to worry about it.) My code will be running in an arbitrary language (we support 9 or 10, including Japanese, which I'll use for my example here). When we run subprocesses, we will sometimes run them in that language, and sometimes in LANG=C, depending on whether the output might be displayed to the user. In cases where the output may be shown to the user, will will then take the text, embed it in our own text (using Template toolkit), and then send it to the user (via Log::Log4perl sending it to both STDOUT and a log file). Most of this is working.

However, one issue still remains. Reading from those subprocesses. If the output is small enough not to flood any buffer, setting the input stream layer to :utf8 seems to work. However, if one of the many buffers involved (remote libC, remote kernel, remote sshd, remote TCP stack, switch, local TCP stack, local kernel, local ssh, local libC, AnyEvent's sysread) manages to split a UTF-8 character, there is the concern that the utf8 layer will not handle this, and the decoding could go awry when I try to interpret it and store it. The key part here is "interpret" - if I don't care about the interim output, it's all fine as I could theoretically just suck it all in to a single scalar and decode it at the end. However, sometimes I need to watch for key words, phrases, tokens, whatever (and, yes, this is more problematic if the text is translated, but that's not part of this question), and report back to the user some sort of progress. So I need to utf8-decode this text incrementally.

So, the question is: is there any way to be sure I have valid UTF8 characters, and to decode them, without losing partial characters? Something like:

$buf =~ s/^((?:[\x00-\x7f]+|[\xc0-0xff][\x80-\xbf]+)*)//; my $newtext = $1; utf8::decode($newtext); $text_so_far .= $newtext;
The major part of that is if that regex is right, or, if not, what would be the appropriate way to determine this?

Most of the time, I want this done, I'll have to signal when I'm using Storable in the subprocess and thus don't want any translation done, but that's a SMOP. And, if it makes things any easier, I think that line-oriented is sufficient. So, as nice as a more-general solution would be, if I understand utf8 right, the "\n" character should only show up in the byte stream if it's an actual "\n" character, and not as a secondary (or later) byte in a utf-8 character. So, would looking for whole lines be sufficient (s/^([^\n]*\n)//) to ensure I don't break up any utf-8 characters, saving the rest of the bytes for the next time through the loop? The check here would be that if the last line doesn't have a terminator, I still have to handle it, but, again, that's a SMOP.

I'm somewhat new to the whole utf8 thing, and I'm still the go-to expert on the team :-S so I want to get it right :-)


In reply to incremental reading of utf8 input handles by Tanktalus

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (3)
    As of 2018-03-18 06:49 GMT
    Find Nodes?
      Voting Booth?
      When I think of a mole I think of:

      Results (228 votes). Check out past polls.