Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Unique-Character Substring

by japhy (Canon)
on Jan 20, 2001 at 04:46 UTC ( #53151=perlcraft: print w/ replies, xml ) Need Help??

   1: =pod
   3:   @longest = pinyan_UCS($string)
   5: This function returns the set of the longest substrings
   6: in a given string.  It seems rather efficient, even though
   7: it calls C<index()> quite a bit.  I found that using a hash
   8: to figure if I'd seen a character had adverse effects.
  10: =cut
  12: sub pinyan_UCS {
  13:   my $str = shift;
  14:   my $len = length $str;
  15:   my ($diff,$biggest) = (0,0);
  16:   my ($jump,@ahead,@matches);
  18:   for (my $i = 0; $i < $len; ) {
  19:     my $match = [ $i, $len ];
  20:     if ($len - $i >= $biggest) {
  21:       for (my $k = $i; $k < $match->[1]; $k++) {
  22:         $ahead[$k] ||= index($str, substr($str,$k,1), $k+1);
  23:         if ($ahead[$k] != -1 and $match->[1] > $ahead[$k]) {
  24:           $match->[1] = $ahead[$k];
  25:           $jump = $k;
  26:         }
  27:       }
  29:       $diff = $match->[1] - $match->[0];
  31:       if ($diff > $biggest) { ($biggest,@matches) = ($diff,$match) }
  32:       elsif ($diff == $biggest) { push @matches, $match; }
  33:     }
  34:     else { last }
  36:     $i = ++$jump;
  37:   }   
  39:   return map substr($str, $_->[0], $_->[1] - $_->[0]), @matches;
  40: }

Comment on Unique-Character Substring
Download Code
Re (tilly) 1: Unique-Character Substring
by tilly (Archbishop) on Jan 20, 2001 at 14:16 UTC
    I found your description rather confusing. You are not interested in substrings, rather in substrings that do not find repeated characters. (I take it that UCS means Unique Character Substrings?) For comparison here is another approach based on recursive splits of the string rather than repeated calls to index. I have no idea how it compares for efficiency.
    sub tilly_UCS { my $str = shift; # Try all cuts. Those that don't fall in 2 are repeats foreach my $char (split //, $str) { my @cut = split /\Q$char\E/, $str, -1; if (2 != @cut) { my @rejoined = map "$cut[$_-1]$char$cut[$_]", 1..$#cut; @rejoined = sort {length $b <=> $a} @rejoined; my @unique = tilly_UCS(shift @rejoined); foreach my $str (@rejoined) { if (length($str) < length($unique[0])) { last; # Avoid useless work, cannot improve } my @found = tilly_UCS($str); if (length($found[0]) > length($unique[0])) { @unique = @found; } elsif (length($found[0]) == length($unique[0])) { push @unique, @found; } } return @unique; } } # No repeats return $str; }
    BTW those that don't know what the third argument to shift did should look it up. You will avoid a commmon misunderstanding that can lead to bugs.

      This doesn't do the same as the original. When fed "The quick brown fox jumps over the lazy dog" it returns "quick brown quick brown quick brown quick brown" rather than the "quick brown" of the original.

      I also found it strange that you return @unique under one condition, and $str under another...

      And I assume you meant the third argument to split rather than shift?


        Oops, on both counts. I meant split, not shift.

        As for the different returns, in one case I verified that it was a unique string and so return the original, in the other I verified it was not and returned the substrings in it. However I don't keep context about locations in the original string, so I cannot tell whether a duplicate is because I found the same string through two paths. In your example through cutting on 'h' then 'e' I get to ' quick brown fox jumped ov' no matter which way I split on 'h' first.

        That is fixable, but not easily.

Re: Unique-Character Substring
by salvadors (Pilgrim) on Jan 20, 2001 at 19:59 UTC

    This can actually be done in O(N):

    sub tony_UCS { local $_ = reverse shift; my $longest = my $tohere = ""; while (my $char = chop $_) { my $idx = index($tohere, $char); if ($idx != -1) { $longest = $tohere if (length $tohere > length $longest); $tohere = substr($tohere, $idx + 1) . $char; } else { $tohere .= $char; } } $longest = $tohere if (length $tohere > length $longest); return $longest; }

    This is about 3 times faster on a short test string, and quickly goes up to 5 times faster for slightly longer strings...


      Wow... very nice. I see exactly what you did. You reversed the domain of the index() -- instead of seeing where 'x' occurs next in the string, you see if it's already occurred in the substring.

      Damn. Nice.

      japhy -- Perl and Regex Hacker
      My turn to offer corrections. :-)

      First of all index is a O(n) operation. If you build a hash first and use that it is still logarithmic (not sure if study does that) but you lose.

      Also if there are multiple substrings of the maximum length you return only one of them. Try the string "hello hello" out. Jeff finds "lo he" and "o hel". You only find "lo he".

        Yes, of course. I'd totally missed the point of returning an array! This will solve that part:

        sub tony_UCS { local $_ = reverse shift; my %longest; my $longest = 0; my $tohere = ""; while (my $char = chop $_) { my $idx = index($tohere, $char); if ($idx != -1) { push @{$longest{$longest = length $tohere}}, $tohere if (length $tohere >= $longest); $tohere = substr($tohere, $idx + 1) . $char; } else { $tohere .= $char; } } push @{$longest{$longest = length $tohere}}, $tohere if (length $tohere >= $longest); return @{$longest{$longest}}; }

        It's still considerably faster than the original, even if not O(n). :)


Re: Unique-Character Substring
by dchetlin (Friar) on Jan 21, 2001 at 09:08 UTC
    Well, heck, if we're going to do this re-posting in a different medium thing, here's what I had on FWP. Without benchmarking, I put my money on it being faster.
    #!/usr/bin/perl -w use strict; local $" = ','; while (<DATA>) { chomp; my @chars = split //; my $index = -1; my %seen = (); my $answer = [""]; while (++$index < @chars) { if (exists $seen{$chars[$index]}) { answer_set(\%seen, $answer); $index = $seen{$chars[$index]}; %seen = (); } else { $seen{$chars[$index]} = $index; } } answer_set(\%seen, $answer); print "$_: @$answer\n"; } sub answer_set { my $seen = shift; goto 'JUMP_' . ((keys %$seen <=> length $_[0][0]) + 1); JUMP_0: return; JUMP_2: $_[0] = []; JUMP_1: push @{$_[0]}, join "", sort { $seen->{$a} <=> $seen->{$b} } keys %$seen; return; } __DATA__ mississippi abcbef adbcbef that abcdefa abcdabc testing this test abcdefg


      Without benchmarking, I put my money on it being faster.

      I'd hold on to your money, if I were you. On a short string mine is about 4 times faster. On a longer string it's about 10 times faster...


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlcraft [id://53151]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (11)
As of 2014-12-22 16:00 GMT
Find Nodes?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?

    Results (121 votes), past polls