Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Finding all substrings

by TheHobbit (Pilgrim)
on Apr 24, 2002 at 17:42 UTC ( #161688=perlquestion: print w/replies, xml ) Need Help??

TheHobbit has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks
I need to find the list of substing of a given string. The following sub does the job

sub substrings { my $string = shift; my @result = (); foreach my $length (1..length($string)) { foreach my $offset (0..length($string)-$length) { push @result,substr($string,$offset,$length); } } return @result; }

but I was wondering... is there a more efficent way to get this set?

Cheers
Leo TheHobbit

Replies are listed 'Best First'.
Re: Finding all substrings
by thelenm (Vicar) on Apr 24, 2002 at 17:54 UTC
    One small optimization you may do is to keep a temporary variable instead of calling length() at each iteration of each loop:
    sub substrings { my $string = shift; my @result = (); my $strlen = length $string; foreach my $length (1..$strlen) { foreach my $offset (0..$strlen-$length) { push @result, substr($string,$offset,$length); } } return @result; }
      Actually it makes it ever so slightly slower, much to my surprise :-/
      use Benchmark qw(cmpthese); sub substrings_TheHobbit { my $string = shift; my @result = (); foreach my $length (1..length($string)) { foreach my $offset (0..length($string)-$length) { push @result,substr($string,$offset,$length); } } return @result; } sub substrings_thelenm { my $string = shift; my @result = (); my $strlen = length $string; foreach my $length (1..$strlen) { foreach my $offset (0..$strlen-$length) { push @result, substr($string,$offset,$length); } } return @result; } cmpthese(-10, { TheHobbit => sub { substrings_TheHobbit("Just Another Perl Hacke +r,") }, thelenm => sub { substrings_thelenm("Just Another Perl Hacker, +") }, }); __output__ Benchmark: running TheHobbit, thelenm, each for at least 10 CPU second +s... TheHobbit: 12 wallclock secs (10.51 usr + 0.03 sys = 10.54 CPU) @ 80 +4.55/s (n=8480) thelenm: 14 wallclock secs (10.44 usr + 0.02 sys = 10.46 CPU) @ 80 +9.37/s (n=8466) Rate TheHobbit thelenm TheHobbit 805/s -- -1% thelenm 809/s 1% --
      Is there any reason why a function call would be quicker accessing a lexical variable?
      _________
      broquaint
        I bet it's because length() is special. All SVs cache their length already, so putting it in a lexical is just a slower way to cache length().

        -sam

        Some repeated benchmarking answers this . . .

        my is expensive. Removing the my and making $strlen a global eliminates the diferrence for the small test case. The bench about the same. Multiplying the length of the test string by ten and leaving $strlen global gives thelenm's sub big gains over the orginal - %6-%10. Finally, making $strlen lexical and retesting reduces those gains by %4-%8, still testing with "Just Another Perl Hacker,"x10.

        So, yes, this is definately an optimization.

        Cheers,
        Erik
      AFAIK the code between parens in the foreach statement is executed just once to obtain a list over which to iterate.
      So yours is not really an optimization.
      Update: i see now your and suaveant point as the inner loop get called multiple times: the original post was not that clear.

      $|=$_='1g2i1u1l2i4e2n0k',map{print"\7",chop;select$,,$,,$,,$_/7}m{..}g

        Very true... but the inner foreach loop gets called multiple times... so there would be a bit of savings... certainly good advice in this case... in a single for loop it would make no difference.

                        - Ant
                        - Some of my best work - (1 2 3)

Re: Finding all substrings
by suaveant (Parson) on Apr 24, 2002 at 18:05 UTC
    I guess the real question is... what are you using this for... you could be more efficient by waiting till you need a set of substrings, then generating and caching all the 3 letter substrs when you need a 3 letter substr... but that only works for certain situations...

    or you could have better space efficiency by keeping a list of offsets in an array, each index indicating the substring length... like so...

    foreach my $length (1..length($string)) { foreach my $offset (0..length($string)-$length) { # push @result,substr($string,$offset,$length); push @{$result[$length]}, $offset; } }
    Untested, but now @result should hold lists of offsets, with the primary index being the length of the substr... which uses much less space...

    really there are endless possibilities... so how can you narrow it down?

                    - Ant
                    - Some of my best work - (1 2 3)

Re: Finding all substrings
by samtregar (Abbot) on Apr 24, 2002 at 18:17 UTC
    This solution depends on two factors, how many times does the code run and how much do the length of the strings vary? If the answers are "a lot" and "not much" then here's a faster solution:
    { my @cache; # a cache of substring-finding subs sub substrings { my $string = shift; my $length = length $string; # use cached sub if we have one return $cache[$length]->($string) if exists $cache[$length]; # create sub to find substrings for this length my $sub = 'sub { $_ = shift; return ('; foreach my $length (1..length($string)) { foreach my $offset (0..length($string)-$length) { $sub .= "substr(\$_,$offset,$length),"; } } $sub .= ")};"; $cache[$length] = eval $sub; # and use it return $cache[$length]->($string); } }
    Using Benchmark.pm and a test case against random words from /usr/dict/words I found this to be 300% faster over 100,000 iterations. Not bad, but I bet we could do better!

    -sam

    PS: Has anyone noticed that <code> doesn't deal with hard tabs right? Copy-and-paste from Emacs is unpleasant.

      Slight improvement, now around 400% faster, at the expense of readability. I got rid of the parameter passing to the cached subs by using $_. Also, I unrolled the last iteration replacing a substr($string,0,length($string)) with just $string. I tried unrolling the first iteration into a split() but that was slower.
      { my @cache; sub substrings { $_ = shift; return &{$cache[length($_)]} if exists $cache[length($_)]; my $sub = 'sub { return ('; foreach my $len (1..length($_)-1) { foreach my $off (0..length($_)-$len) { $sub .= "substr(\$_,$off,$len),"; } } $sub .= "\$_)};"; $cache[length($_)] = eval $sub; return &{$cache[length($_)]}; } }
      The only trick I have left is Inline::C... I'll leave that as an exercise for the reader.

      -sam

      Says samtregar:
      I found this to be 300% faster over 100,000 iterations. Not bad, but I bet we could do better!
      Supposing that the original version took 10 seconds to run, then yours, 300% faster, completes and delivers the answer 20 seconds before you even enter the command that runs it. That is an impressive achievement indeed. I admire your optimization skills.

      Perhaps you meant that yours was three times as fast? If so, it was 66.67% faster, not 300%.

      Hope this helps.

      --
      Mark Dominus
      Perl Paraphernalia

        Um, just reading the Benchmark output man:
        Rate original new original 7710/s -- -80% new 38610/s 401% --
        If that doesn't mean 400% faster then I guess I need more education.

        -sam

        If it took 10 seconds, and was 300% faster, that would merely mean it was 3 seconds long. I think. Actually i think theyre both interchangeable
Re: Finding all substrings
by Dominus (Parson) on Apr 24, 2002 at 19:28 UTC
    I don't know if this is faster (although I suspect it is) but it certainly is cooler:
    sub substrings { my @ss; $_[0] =~ /.*?(.+?)(?{push @ss, $1})(?!)/; @ss; }

    --
    Mark Dominus
    Perl Paraphernalia

      Um, cool! But it leaks memory at an insane rate on 5.6.1. Trying to run 100,000 iterations nearly made my computer explode.

      -sam

        Says samtregar:
        Um, cool! But it leaks memory at an insane rate on 5.6.1.
        Jeff Pinyan explained this one to me. The code in the regex captures the first instance of @ss, and keeps appending to it, 1023 new elements every time through the loop. You can fix the problem by using a static variable:
        { my @ss; sub substrings { @ss = (); $_[0] =~ /.*?(.+?)(?{push @ss, $1})(?!)/; @ss; } }

        --
        Mark Dominus
        Perl Paraphernalia

Re: Finding all substrings
by erikharrison (Deacon) on Apr 24, 2002 at 18:12 UTC

    The root of optimization is finding a solid algorithm, then implementing it, then speeding the implementation. You algorithm is solid (I don't know of a faster one . . .monks?) and your implementation is simple, meaning not much room for speed gain. Offhand though, C style for loops might be slightly faster (perl does an implicit ++ and check when you call the range operator, hence the rationale that doing it directly might be faster).

    Cheers,
    Erik
      I disagree. The root of optimization is finding a solid algorithm, implementing it, tuning it and then finding a way to exploit caching to make it faster. See my solution below for an example. See Memoize for another option.

      -sam

Re: Finding all substrings
by sfink (Deacon) on Apr 24, 2002 at 20:35 UTC
    This runs a tiny bit faster for me: (5% on short strings, much less for longer strings.)
    sub substrings { my $string = shift; my @result = (); foreach my $start (0..length($string)-1) { my $substr = substr($string, $start); while (length($substr)) { push @result, $substr; chop($substr); } } return @result; }
    But forget speed. The fun solution is:
    sub substrings { local $_ = shift; my @result; do { push @result, /(?=(.+)$)/sg; chop } while (length); return @result; }
    Now if I could only figure out how to do it all in one regex... anyone?

    Update: That durn mjd got there before I even submitted this... though he used embedded code. Is there any way without it?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://161688]
Approved by DaWolf
Front-paged by DaWolf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2022-06-25 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (81 votes). Check out past polls.

    Notices?