Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Perl-Sensitive Sunglasses
 
PerlMonks  

limiting length of utf8 string in bytes

by pbijnens (Novice)
on Dec 14, 2009 at 11:58 UTC ( #812694=perlquestion: print w/ replies, xml ) Need Help??
pbijnens has asked for the wisdom of the Perl Monks concerning the following question:

What would be a nice way to limit a string to a certain length in bytes, but avoid chopping in the middle of a multibyte unicode character. Something like this works:
  use utf8; # the literal strings are in utf8
  binmode(STDOUT, ":utf8");
  my $maxbytes = 5;
  my $a= "יטא";  # length: 3 chars, 6 bytes
  print $a, "\n";
  {
    use bytes;
    $a = substr($a,0,5)  if length($a) > 5;
  }
  use Encode;
  $a = decode_utf8($a,Encode::FB_QUIET);
  print $a, "\n";   # 2 chars, 4 bytes now
But I feel there should be something simpler...

Comment on limiting length of utf8 string in bytes
Re: limiting length of utf8 string in bytes
by RMGir (Parson) on Dec 14, 2009 at 12:44 UTC
    On a utf8 string, chop appears to do 'the right thing', i.e. remove one trailing utf8 character, regardless of how many bytes it is.

    I guess you could keep chop-ping your string while length>$threshold, but that's O(excess characters), which might get painful.

    The other alternative is to proceed by inspection - under 'use bytes', examine the characters at the $threshold+1 position, and, working your way backwards, "substr" before that character if it's a valid utf8 start character.

    That would require at most 4 loop iterations for valid utf8, I think.


    Mike
Re: limiting length of utf8 string in bytes
by ikegami (Pope) on Dec 14, 2009 at 16:59 UTC
    sub limit_bytes { my ($str, $max_bytes) = @_; utf8::encode $str; if (length($str) > $max_bytes) { substr($str, $max_bytes+1) = ''; $str =~ s/(?:[\xC0-\xFF]?[\x80-\xBF]+|.)\z//; } utf8::decode $str; return $str; }

    Using use bytes; to encode is a bad idea, but if that's what you want, don't forget to also use utf8::upgrade.

    sub limit_bytes { my ($str, $max_bytes) = @_; utf8::upgrade $str; use bytes; if (length($str) > $max_bytes) { substr($str, $max_bytes+1) = ''; $str =~ s/(?:[\xC0-\xFF]?[\x80-\xBF]+|.)\z//; } return $str; }
Re: limiting length of utf8 string in bytes
by ambrus (Abbot) on Dec 15, 2009 at 10:57 UTC

    There's a snippit for this in the output_message function of cbstream.rb. (This code has quote a few places where it does ugly hacks with character encodings, because it was written back when only ruby 1.8 existed. Nowadays we have ruby 1.9 which has a better system for handling strings with various encodings than perl.) It's not directly applicable here, but the principle is the same.

    Assumes $str contains the decoded string. Then, after

    $str =~ /\A(.{0,383}[\x00-\xbf]|)/s or die;
    $1 should contain at most 384 bytes and not end with an incomplete utf-8 character.

    Update: ikegami's right, the above regex is wrong. (I still believe the one in cbstream is right, but does something different.)

      That's wrong. It can chop up 3 and 4 byte chars. See my reply to the OP for the fix.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://812694]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2014-04-25 09:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (586 votes), past polls