Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Truncate string to limited length, throwing away unimportant characters first.

by ambrus (Abbot)
on Mar 16, 2010 at 18:32 UTC ( [id://828994]=perlquestion: print w/replies, xml ) Need Help??

ambrus has asked for the wisdom of the Perl Monks concerning the following question:

I need to make sure that a string is at most 256 characters long. For this, I'd need some perl code that checks if the string is too long, and if it is, it truncates it to at most 256 characters by throwing away whitespace characters from the end of string first, then whitespace characters from the beginning of the string, then any characters from the end of the string.

Do not remove more characters than necessary, so if the string is already at most 256 characters long, then don't remove anything, and if it's longer, then the result shall be exactly 256 characters long.

Here are some examples, assuming for simplicity that I wanted to truncate to 6 characters instead of 256 characters.
inputoutput
"  ab ""  ab "
"  ab   ""  ab  "
"  abc   ""  abc "
"  abcd   ""  abcd"
"  abcde   "" abcde"
"  abcdef   ""abcdef"
"  abcdefg   ""abcdef"

What do you think is the best way truncate an input string this way with some perl code? Below is one solution (again using 6 instead of 256), but it might not be the best one.

Update: the code below was wrong, as ikegami points out in the reply.

It should work now. The bug was that I wrote /\A(\s*)(.*)(\s*)\z/s instead of /\A(\s*)(.*)(\s*)\z/s.

use warnings; use strict; for my $i ( " ab ", " ab ", " abc ", " abcd ", " abcde ", " abcdef ", " abcdefg " ) {
$i =~ /\A(\s*)(.*?)(\s*)\z/s or die; my $o; if (length($1) + length($2) < 6) { $o = substr($i, 0, 6); } elsif (length($2) < 6) { $o = substr($1 . $2, -6); } else { $o = substr($2, 0, 6); }
printf "%-15s%s", qq("$i"), qq( => "$o"\n); } __END__

Replies are listed 'Best First'.
Re: Truncate string to limited length, throwing away unimportant characters first.
by merlyn (Sage) on Mar 17, 2010 at 02:39 UTC
    You guys are all trying too hard:
    s/\s$// || s/^\s// || s/.$// while length > 256;

    -- Randal L. Schwartz, Perl hacker

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

      I figured this code would be executed repeatedly, so performance would matter.

      For strings with very few surrounding spaces,

      length=300 Rate merlyn ikegami merlyn 8132/s -- -80% ikegami 41078/s 405% -- length=500 Rate merlyn ikegami merlyn 1660/s -- -91% ikegami 18224/s 998% -- length=1000 Rate merlyn ikegami merlyn 488/s -- -92% ikegami 6095/s 1149% --

      I wrote a more comprehensive benchmark, but I'm getting inconsistent results at the moment. I'll run on it a more stable machine tomorrow.

      use strict; use warnings; use Benchmark qw( cmpthese ); my %tests = ( ikegami11 => 's/^(.{256,}?)\s+\z/$1/s, s/^\s*(.{256}).*\z/$1/s + if length > 256;', ikegami21 => 's/(?<=.{6})\s+\z//s, s/^\s*(.{256}).*\z/$1/s + if length > 256;', ikegami31 => 's/.{6}\K\s+\z//s, s/^\s*(.{256}).*\z/$1/s + if length > 256;', ikegami12 => 's/^(.{256,}?)\s+\z/$1/s, s/^\s+(?=.{6})//s, s/(?<=^.{ +6}).*\z//s if length > 256;', ikegami22 => 's/(?<=.{6})\s+\z//s, s/^\s+(?=.{6})//s, s/(?<=^.{ +6}).*\z//s if length > 256;', ikegami32 => 's/.{6}\K\s+\z//s, s/^\s+(?=.{6})//s, s/(?<=^.{ +6}).*\z//s if length > 256;', ikegami13 => 's/^(.{256,}?)\s+\z/$1/s, s/^\s+(?=.{6})//s, s/(?<=.{6 +}).*\z//s if length > 256;', ikegami23 => 's/(?<=.{6})\s+\z//s, s/^\s+(?=.{6})//s, s/(?<=.{6 +}).*\z//s if length > 256;', ikegami33 => 's/.{6}\K\s+\z//s, s/^\s+(?=.{6})//s, s/(?<=.{6 +}).*\z//s if length > 256;', ikegami14 => 's/^(.{256,}?)\s+\z/$1/s, s/^\s+(?=.{6})//s, s/^.{6}\K +.*\z//s if length > 256;', ikegami24 => 's/(?<=.{6})\s+\z//s, s/^\s+(?=.{6})//s, s/^.{6}\K +.*\z//s if length > 256;', ikegami34 => 's/.{6}\K\s+\z//s, s/^\s+(?=.{6})//s, s/^.{6}\K +.*\z//s if length > 256;', merlyn => 's/\s$// || s/^\s// || s/.$// while length > 256;', ); $_ = 'use strict; use warnings; local $_ = our $pat;' . $_ for values(%tests); for my $len (256, 260, 300, 500) { for our $pat ( ' ' x $len, (' ' x ($len/2)) . ('x' x ($len/2)), 'x' x $len, ) { printf("length=%u, pat=\"%s...%s\"\n", $len, substr($pat, 0, 5), + substr($pat, -5)); cmpthese(-1, \%tests); print("\n"); } }
      Results. They vary *a lot* based on the input pattern.
Re: Truncate string to limited length, throwing away unimportant characters first.
by ikegami (Patriarch) on Mar 16, 2010 at 19:22 UTC

    by throwing away whitespace characters from the end of string first, then whitespace characters from the beginning of the string, then any characters from the end of the string.

    Your code removes from the beginning first. It doesn't produce the result in the table you posted.


    A regex solution:

    s/^(.{6,}?)\s+\z/$1/s; s/^\s*(.{6}).*\z/$1/s;

    Possibly faster alts for first line:

    s/(?<=.{6})\s+\z//s;
    s/.{6}\K\s+\z//s; # Req 5.10

    Possibly faster alts for second line:

    s/^\s+(?=.{6})//s; s/(?<=^.{6}).*\z//s;
    s/^\s+(?=.{6})//s; s/(?<=.{6}).*\z//s;
    s/^\s+(?=.{6})//s; s/^.{6}\K.*\z//s; # Req 5.10
Re: Truncate string to limited length, throwing away unimportant characters first.
by kennethk (Abbot) on Mar 16, 2010 at 18:42 UTC
    How about some loop control mixed with regular expressions instead of substr?

    If this is running the regular expression engine a bit much, you can refactor into three while loops, though I think this is less clear for casual inspection:

    Update: Or better yet, avoid the looping with quantifiers:

Re: Truncate string to limited length, throwing away unimportant characters first.
by BrowserUk (Patriarch) on Mar 16, 2010 at 20:22 UTC

    Three loops, but only one or two will ever iterate:

    #! perl -slw use strict; sub truncTo { my( $s, $t ) = @_; my $l = length $s; $s =~ m[^\s*(.+?)\s*$]; my( $b, $e ) = ( $-[1], $+[1] ); ++$e while $e < $l and ( $e - $b ) < $t; --$b while $b and ( $e - $b ) < $t; --$e while ( $e - $b ) > $t; return substr( $s, $b, $e-$b ); } for( " ab ", " ab ", " abc ", " abcd ", " abcde ", " abcdef ", " abcdefg ", ) { printf "%15.15s => '%s'\n", "'$_'", truncTo( $_, 6 ); } __END__ c:\test>828994.pl ' ab ' => ' ab ' ' ab ' => ' ab ' ' abc ' => 'abc ' ' abcd ' => 'abcd ' ' abcde ' => 'abcde ' ' abcdef ' => 'abcdef' ' abcdefg ' => 'abcdef'

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Practically all the tests failed. You are removing spaces from the beginning before removing spaces from the end.

        Whoops! I got the order reversed in my mind. All that's needed is to reverse the ordering of the first two loops:

        #! perl -slw use strict; sub truncTo { my( $s, $t ) = @_; my $l = length $s; $s =~ m[^\s*(.+?)\s*$]; my( $b, $e ) = ( $-[1], $+[1] ); --$b while $b and ( $e - $b ) < $t; ++$e while $e < $l and ( $e - $b ) < $t; --$e while ( $e - $b ) > $t; return substr( $s, $b, $e-$b ); } for( " ab ", " ab ", " abc ", " abcd ", " abcde ", " abcdef ", " abcdefg ", ) { printf "%15.15s => '%s'\n", "'$_'", truncTo( $_, 6 ); } __END__ c:\test>828994.pl ' ab ' => ' ab ' ' ab ' => ' ab ' ' abc ' => ' abc ' ' abcd ' => ' abcd' ' abcde ' => ' abcde' ' abcdef ' => 'abcdef' ' abcdefg ' => 'abcdef'

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Truncate string to limited length, throwing away unimportant characters first.
by rowdog (Curate) on Mar 16, 2010 at 21:00 UTC

    I decided to abuse sprintf, just for fun. I imagine that substr and ikegami's regex solutions are significantly faster with real data.

    #!/usr/bin/perl use strict; use warnings; my $max_len = 6; for my $i ( " ab ", " ab ", " abc ", " abcd ", " abcde ", " abcdef ", " abcdefg " ) { my $o = abuse_sprintf($i); printf "%-15s%s", qq("$i"), qq( => "$o"\n); } sub abuse_sprintf { my $i = shift; length $i <= $max_len and return $i; my ( $a, $b, $c ) = $i =~ /\A(\s*)(.*?)(\s*)\z/s or die "bad input"; length($b) >= $max_len and return sprintf "%.*s", $max_len, $b; length($a) + length($b) >= $max_len and return sprintf "%.*s%s", $max_len - length($b), $a, $b; return sprintf "%.*s", $max_len, $a . $b . $c; }
Re: Truncate string to limited length, throwing away unimportant characters first.
by deMize (Monk) on Mar 17, 2010 at 04:20 UTC
    I didn't see your answer before I typed this, but it's basically the same thing as your if statement, just more descriptive than it needs to be and it uses if-return structure rather than if/elsif/else, yours being more efficient.
    I think I'd rather the method you posted, but would store length($2) in a variable, since that is the only one that may be called twice, and depending on the string, it could be the longest, so it'd save the most cpu. -- this depends on how often the second if-block would be triggered
    #/usr/bin/perl use strict; use constant TRUE => 1; use constant FALSE => 0; our $debug = FALSE; print truncate_string(" abcdefg ",256); sub truncate_string{ my ($string,$trunc_length) = @_; $string =~ m/^(\s*)(.*?)(\s*)$/; my $space_count_pre = length($1); my $letter_count = length($2); my $space_text_length = $space_count_pre + $letter_count; sub debug{ print "Beginning Spaces: $space_count_pre\n"; print "Text Length: $letter_count\n"; print "Trunc Length: $trunc_length\n"; print "String: \"$string\"\n\n"; } debug() if $debug; # if the final string is already less than the current number of + alphanumeric characters if ($trunc_length <= $letter_count){$string = substr($2,0,$trunc +_length);return $string;} # if the final string is bigger than the beginning spaces + text if ($trunc_length > $space_text_length) {$string = "$1$2" . subs +tr($3,0,$trunc_length - $space_text_length);return $string;} # if the final string is bigger than text, but smaller than begi +nning spaces + text if ($trunc_length > $letter_count && $trunc_length < $space_text +_length){$string = substr($1,0,$trunc_length - $space_text_length) . +$2;return $string;} }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://828994]
Approved by marto
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-23 15:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found