Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

limiting length of utf8 string in bytes

by pbijnens (Novice)
on Dec 14, 2009 at 11:58 UTC ( #812694=perlquestion: print w/ replies, xml ) Need Help??
pbijnens has asked for the wisdom of the Perl Monks concerning the following question:

What would be a nice way to limit a string to a certain length in bytes, but avoid chopping in the middle of a multibyte unicode character. Something like this works:
  use utf8; # the literal strings are in utf8
  binmode(STDOUT, ":utf8");
  my $maxbytes = 5;
  my $a= "יטא";  # length: 3 chars, 6 bytes
  print $a, "\n";
  {
    use bytes;
    $a = substr($a,0,5)  if length($a) > 5;
  }
  use Encode;
  $a = decode_utf8($a,Encode::FB_QUIET);
  print $a, "\n";   # 2 chars, 4 bytes now
But I feel there should be something simpler...

Comment on limiting length of utf8 string in bytes
Re: limiting length of utf8 string in bytes
by RMGir (Prior) on Dec 14, 2009 at 12:44 UTC
    On a utf8 string, chop appears to do 'the right thing', i.e. remove one trailing utf8 character, regardless of how many bytes it is.

    I guess you could keep chop-ping your string while length>$threshold, but that's O(excess characters), which might get painful.

    The other alternative is to proceed by inspection - under 'use bytes', examine the characters at the $threshold+1 position, and, working your way backwards, "substr" before that character if it's a valid utf8 start character.

    That would require at most 4 loop iterations for valid utf8, I think.


    Mike
Re: limiting length of utf8 string in bytes
by ikegami (Pope) on Dec 14, 2009 at 16:59 UTC
    sub limit_bytes { my ($str, $max_bytes) = @_; utf8::encode $str; if (length($str) > $max_bytes) { substr($str, $max_bytes+1) = ''; $str =~ s/(?:[\xC0-\xFF]?[\x80-\xBF]+|.)\z//; } utf8::decode $str; return $str; }

    Using use bytes; to encode is a bad idea, but if that's what you want, don't forget to also use utf8::upgrade.

    sub limit_bytes { my ($str, $max_bytes) = @_; utf8::upgrade $str; use bytes; if (length($str) > $max_bytes) { substr($str, $max_bytes+1) = ''; $str =~ s/(?:[\xC0-\xFF]?[\x80-\xBF]+|.)\z//; } return $str; }
Re: limiting length of utf8 string in bytes
by ambrus (Abbot) on Dec 15, 2009 at 10:57 UTC

    There's a snippit for this in the output_message function of cbstream.rb. (This code has quote a few places where it does ugly hacks with character encodings, because it was written back when only ruby 1.8 existed. Nowadays we have ruby 1.9 which has a better system for handling strings with various encodings than perl.) It's not directly applicable here, but the principle is the same.

    Assumes $str contains the decoded string. Then, after

    $str =~ /\A(.{0,383}[\x00-\xbf]|)/s or die;
    $1 should contain at most 384 bytes and not end with an incomplete utf-8 character.

    Update: ikegami's right, the above regex is wrong. (I still believe the one in cbstream is right, but does something different.)

      That's wrong. It can chop up 3 and 4 byte chars. See my reply to the OP for the fix.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://812694]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2015-07-01 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What kind of chocolate gives you the most pleasure?















    Results (810 votes), past polls